Methods and systems for intelligent sampling of normal and erroneous application traces

ABSTRACT

Computer-implemented methods and systems described herein perform intelligent sampling of application traces generated by an application. Computer-implemented methods and systems determine different sampling rates based on frequency of occurrence of normal traces and erroneous traces of the application. The sampling rates for low frequency normal and erroneous traces are larger than the sampling rates for high frequency normal and erroneous traces. The relatively larger sampling rates for low frequency trace ensures that low frequency traces are sampled in sufficient numbers and are not passed over during sampling of the application traces. The sampled normal and erroneous traces are stored in a data storage device.

CROSS-REFERENCE TO RELATED APPLICATION

This application claims the benefit of U.S. Provisional Application No.63/155,349, filed Mar. 3, 2021.

TECHNICAL FIELD

This disclosure is directed to automated methods and systems forintelligent sampling of application traces.

BACKGROUND

Electronic computing has evolved from primitive, vacuum-tube-basedcomputer systems, initially developed during the 1940s, to modernelectronic computing systems in which large numbers of multi-processorcomputer systems, such as server computers, work stations, and otherindividual computing systems are networked together with large-capacitydata storage devices and other electronic devices to producegeographically distributed computing systems with hundreds of thousands,millions, or more components that provide enormous computationalbandwidths and data-storage capacities. These large, distributedcomputing systems include data centers and are made possible by advancesin computer networking, distributed operating systems and applications,data-storage appliances, computer hardware, and software technologies.*The number and size of data centers have continued to grow to meet theincreasing demand for information technology (“IT”) services, such asrunning applications for organizations that provide business services,web services, and other cloud services to millions of customers eachday.

Management tools have been developed to collect traces of applicationsand aid system administrators and application owners with detectingperformance problems with applications executed in distributed computingsystems. An application trace, or simply a “trace,” is a representationof a workflow executed by an application, such as the workflow ofapplication components of a distributed application. Application ownersanalyze application traces to detect performance problems with theirapplications. For example, a distributed application may have multipleapplication components executed in VMs or containers on one or morehosts of a data center. The application traces are stored and used byadministrators and application developers to troubleshoot performanceproblems and perform root cause analysis.

Storage of application traces for a plurality of applications executingin a distributed computing environment over time creates an increasingdemand for available data storage space. For example, a typicaldistributed application that serves hundreds of thousands of clientseach day generates hundreds of thousands of corresponding applicationtraces that are stored in data storage devices each day. For applicationowners, storing an enormous number of application traces increases thecosts of operation. In addition, application traces that revealperformance problems associated with execution of an application, callederroneous traces, often occur with far lower frequencies than normalapplication traces that indicate normal execution of an application. Asa result, system administrators and application developers sift throughmillions of application traces to identify the small number of erroneoustraces, which is expensive and time consuming. Typical management toolsemploy sampling procedures that sample and store a fraction of theapplication traces in an effort to reduce the storage space occupied byapplications traces and reduce the amount of time and cost associatedwith identifying erroneous traces. However, these sampling proceduresfail to distinguish between the different types of traces. As a result,infrequently generated erroneous traces are often missed duringsampling, which makes troubleshooting a performance problem a morechallenging task. One approach is to store all erroneous traces.However, in certain situations the number of erroneous traces farexceeds the number of normal application traces, which eventually leadsto the same problem of not having enough storage space available fornormal and erroneous traces. Application owners and systemadministrators seek computer-implemented methods and systems that, ingeneral, reduce the number of stored application traces, do not undersample or miss low frequency erroneous traces, and reduce the number ofstored erroneous traces when erroneous traces outnumber normalapplication traces.

SUMMARY

Computer-implemented methods and systems described herein performintelligent sampling of normal and erroneous traces of an application. Aset of trace data associated with the application is from a data storagedevice. The trace data may be stored in a trace database or temporarilystored in a buffer. Computer-implemented methods and systems determinesampling rates for sampling normal traces in the set and for samplingerroneous traces in the set. The different sampling rates are inverselyproportional to the frequency of occurrence of the normal traces anderroneous traces. The sampling rates are used to obtain sampled normaltraces and sampled erroneous traces. The sampling rates ensure that lessfrequently occurring normal traces are sampled at higher sampling ratesthan more frequently occurring normal traces and that less frequentlyoccurring erroneous traces are sampled at higher sampling rates thanmore frequently occurring erroneous traces. The sampled traces arestored in a data storage device.

DESCRIPTION OF THE DRAWINGS

FIG. 1 shows an architectural diagram for various types of computers.

FIG. 2 shows an Internet-connected distributed computer system.

FIG. 3 shows cloud computing.

FIG. 4 shows generalized hardware and software components of ageneral-purpose computer system.

FIGS. 5A-5B show two types of virtual machine (“VM”) and VM executionenvironments.

FIG. 6 shows an example of an open virtualization format package.

FIG. 7 shows example virtual data centers provided as an abstraction ofunderlying physical-data-center hardware components.

FIG. 8 shows virtual-machine components of a virtual-data-centermanagement server and physical servers of a physical data center.

FIG. 9 shows a cloud-director level of abstraction.

FIG. 10 shows virtual-cloud-connector nodes.

FIG. 11 shows an example server computer used to host three containers.

FIG. 12 shows an approach to implementing containers on a VM.

FIG. 13 shows an example of a virtualization layer located above aphysical data center.

FIGS. 14A-14B show an example of a distributed application and anexample application trace.

FIGS. 15A-15B show examples of erroneous traces for the distributedapplication represented in FIG. 14A.

FIG. 16A shows an example graphical-user interface (“GUI”) that enablesa user to select an application and input sampling rates for samplingnormal and erroneous traces of an application.

FIG. 16B shows an example of a computer system that executesmachine-readable instructions for sampling traces.

FIG. 17 shows an example set of trace data generated by an application.

FIG. 18 shows an example calculation of traces sampled from a set oftrace data sorted according to trace type.

FIG. 19 shows an example of erroneous traces partitioned into sets oferroneous traces based on error status codes.

FIG. 20 shows an example of a set of trace data sorted according totrace durations.

FIG. 21 shows an example of partitioning duration-sorted traces.

FIG. 22 shows an example histogram constructed from a set of trace data.

FIG. 23 shows an example of normal traces and example of erroneoustraces the same trace type.

FIG. 24 shows an example of set of trace data partitioned into normaltraces and erroneous traces based on erroneous and normal status.

FIG. 25 shows an example set of erroneous traces partitioned intoaccording to different error codes.

FIG. 26A-26D shows a plot of modified Gini indices versus samplingparameters.

FIG. 27 is a flow diagram illustrating an example implementation of a“method for sampling traces of an application.”

FIG. 28 is a flow diagram illustrating an example implementation of the“determine a normal trace sampling rate and an erroneous trace samplingrate” procedure performed in FIG. 27.

FIG. 29 is a flow diagram illustrating an example implementation of the“determine sampling rates based on trace type and/or duration” procedureperformed in FIG. 28.

FIG. 30 is a flow diagram illustrating an example implementation of the“determine hybrid-sampling rates” procedure performed in FIG. 29.

FIG. 31 is a flow diagram illustrating an example implementation of the“determine trace-type sampling rates” procedure performed in FIG. 29.

FIG. 32 is a flock diagram illustrating an example implementation of the“determine duration-sampling rates” procedure performed in FIG. 29.

FIG. 33 is a flow diagram illustrating an example implementation of the“determine normal and erroneous trace sampling rates” procedure in FIG.28.

FIG. 34 is a flow diagram illustrating an example implementation of the“determine normal trace sampling rate” procedure in FIG. 28.

FIG. 35 is a flow diagram illustrating an example implementation of the“sample normal traces using the normal trace sampling rate and theerroneous trace sampling rate” procedure performed in FIG. 27.

FIG. 36 is a flow diagram illustrating an example implementation of the“sample traces using hybrid-sampling rates” procedure performed in FIG.35.

FIG. 37 is a flow diagram illustrating an example implementation of the“sample traces using trace-type sampling rates” procedure performed inFIG. 35.

FIG. 38 is a flow diagram illustrating an example implementation of the“sample traces using duration-sampling rates” procedure performed inFIG. 35.

FIG. 39 is a flow diagram illustrating an example implementation of the“sample traces using normal and erroneous sampling rates” procedureperformed in FIG. 35.

DETAILED DESCRIPTION

This disclosure presents computer-implemented methods and systems thatintelligently sample application traces generated by applicationsrunning in a distributed computing system. In the first subsection,computer hardware, complex computational systems, and virtualization aredescribed. Computer-implemented methods and systems for intelligentsampling of normal and erroneous application traces are described belowin the second subsection.

Computer Hardware, Complex Computational Systems, and Virtualization

The term “abstraction” as used to describe virtualization below is notintended to mean or suggest an abstract idea or concept. Computationalabstractions are tangible, physical interfaces that are implemented,ultimately, using physical computer hardware, data storage devices, andcommunications systems. Instead, the term “abstraction” refers, in thecurrent discussion, to a logical level of functionality encapsulatedwithin one or more concrete, tangible, physically-implemented computersystems with defined interfaces through which electronically-encodeddata is exchanged, process execution launched, and electronic servicesare provided. Interfaces may include graphical and textual datadisplayed on physical display devices as well as computer programs androutines that control physical computer processors to carry out varioustasks and operations and that are invoked through electronicallyimplemented application programming interfaces (“APIs”) and otherelectronically implemented interfaces.

FIG. 1 shows a general architectural diagram for various types ofcomputers. Computers that receive, process, and store log messages maybe described by the general architectural diagram shown in FIG. 1, forexample. The computer system contains one or multiple central processingunits (“CPUs”) 102-105, one or more electronic memories 108interconnected with the CPUs by a CPU/memory-subsystem bus 110 ormultiple busses, a first bridge 112 that interconnects the CPUmemory-subsystem bus 110 with additional busses 114 and 116, or othertypes of high-speed interconnection media, including multiple,high-speed serial interconnects. These busses or serialinterconnections, in turn, connect the CPUs and memory with specializedprocessors, such as a graphics processor 118, and with one or moreadditional bridges 120, which are interconnected with high-speed seriallinks or with multiple controllers 122-127, such as controller 127, thatprovide access to various different types of mass-storage devices 128,electronic displays, input devices, and other such components,subcomponents, and computational devices. It should be noted thatcomputer-readable data storage devices include optical andelectromagnetic disks, electronic memories, and other physical datastorage devices.

Of course, there are many different types of computer-systemarchitectures that differ from one another in the number of differentmemories, including different types of hierarchical cache memories, thenumber of processors and the connectivity of the processors with othersystem components, the number of internal communications busses andserial links, and in many other ways. However, computer systemsgenerally execute stored programs by fetching instructions from memoryand executing the instructions in one or more processors. Computersystems include general-purpose computer systems, such as personalcomputers (“PCs”), various types of server computers and workstations,and higher-end mainframe computers, but may also include a plethora ofvarious types of special-purpose computing devices, includingdata-storage systems, communications routers, network nodes, tabletcomputers, and mobile telephones.

FIG. 2 shows an Internet-connected distributed computer system. Ascommunications and networking technologies have evolved in capabilityand accessibility, and as the computational bandwidths, data-storagecapacities, and other capabilities and capacities of various types ofcomputer systems have steadily and rapidly increased, much of moderncomputing now generally involves large distributed systems and computersinterconnected by local networks, wide-area networks, wirelesscommunications, and the Internet. FIG. 2 shows a typical distributedsystem in which a large number of PCs 202-205, a high-end distributedmainframe system 210 with a large data-storage system 212, and a largecomputer center 214 with large numbers of rack-mounted server computersor blade servers all interconnected through various communications andnetworking systems that together comprise the Internet 216, Suchdistributed computing systems provide diverse arrays of functionalities.For example, a PC user may access hundreds of millions of different websites provided by hundreds of thousands of different web serversthroughout the world and may access high-computational-bandwidthcomputing services from remote computer facilities for running complexcomputational tasks.

Until recently, computational services were generally provided bycomputer systems and data centers purchased, configured, managed, andmaintained by service-provider organizations. For example, an e-commerceretailer generally purchased, configured, managed, and maintained a datacenter including numerous web server computers, back-end computersystems, and data-storage systems for serving web pages to remotecustomers, receiving orders through the web-page interface, processingthe orders, tracking completed orders, and other myriad different tasksassociated with an e-commerce enterprise.

FIG. 3 shows cloud computing. In the recently developed cloud-computingparadigm, computing cycles and data-storage facilities are provided toorganizations and individuals by cloud-computing providers. In addition,larger organizations may elect to establish private cloud-computingfacilities in addition to, or instead of, subscribing to computingservices provided by public cloud-computing service providers. In FIG.3, a system administrator for an organization, using a PC 302, accessesthe organization's private cloud 304 through a local network 306 andprivate-cloud interface 308 and accesses, through the Internet 310, apublic cloud 312 through a public-cloud services interface 314. Theadministrator can, in either the case of the private cloud 304 or publiccloud 312, configure virtual computer systems and even entire virtualdata centers and launch execution of application programs on the virtualcomputer systems and virtual data centers in order to carry out any ofmany different types of computational tasks. As one example, a smallorganization may configure and run a virtual data center within a publiccloud that executes web servers to provide an e-commerce interfacethrough the public cloud to remote customers of the organization, suchas a user viewing the organization's e-commerce web pages on a remoteuser system 316.

Cloud-computing facilities are intended to provide computationalbandwidth and data-storage services much as utility companies provideelectrical power and water to consumers. Cloud computing providesenormous advantages to small organizations without the devices topurchase, manage, and maintain in-house data centers. Such organizationscan dynamically add and delete virtual computer systems from theirvirtual data centers within public clouds in order to trackcomputational-bandwidth and data-storage needs, rather than purchasingsufficient computer stems within a physical data center to handle peakcomputational-bandwidth and data-storage demands. Moreover, smallorganizations can completely avoid the overhead of maintaining andmanaging physical computer systems, including hiring and periodicallyretraining information-technology specialists and continuously payingfor operating-system and database-management-system upgrades.Furthermore, cloud-computing interfaces allow for easy andstraightforward configuration of virtual computing facilities,flexibility in the types of applications and operating systems that canbe configured, and other functionalities that are useful even for ownersand administrators of private cloud-computing facilities used by asingle organization.

FIG. 4 shows generalized hardware and software components of ageneral-purpose computer system, such as a general-purpose computersystem having an architecture similar to that shown in FIG. 1. Thecomputer system 400 is Mien considered to include three fundamentallayers: (1) a hardware layer or level 402; (2) an operating-system layeror level 404; and (3) an application-program layer or level 406. Thehardware layer 402 includes one or more processors 408, system memory410, various different types of input-output (I/O) devices 410 and 412,and mass-storage devices 414. Of course, the hardware level alsoincludes many other components, including power supplies, internalcommunications links and busses, specialized integrated circuits, manydifferent types of processor-controlled or microprocessor-controlledperipheral devices and controllers, and many other components. Theoperating system 404 interfaces to the hardware level 402 through alow-level operating system and hardware interface 416 generallycomprising a set of non-privileged computer instructions 418, a set ofprivileged computer instructions 420, a set of non-privileged registersand memory addresses 422, and a set of privileged registers and memoryaddresses 424. In general, the operating system exposes non-privilegedinstructions, non-privileged registers, and non-privileged memoryaddresses 426 and a system-call interface 428 as an operating-systeminterface 430 to application programs 432-436 that execute within anexecution environment provided to the application programs by theoperating system. The operating system, alone, accesses the privilegedinstructions, privileged registers, and privileged memory addresses. Byreserving access to privileged instructions, privileged registers, andprivileged memory addresses, the operating system can ensure thatapplication programs and other higher-level computational entitiescannot interfere with one another's execution and cannot change theoverall state of the computer system in ways that could deleteriouslyimpact system operation. The operating system includes many internalcomponents and modules, including a scheduler 442, memory management444, a file system 446, device drivers 448, and many other componentsand modules. To a certain degree, modern operating systems providenumerous levels of abstraction above the hardware level, includingvirtual memory, which provides to each application program and othercomputational entities a separate, large, linear memory-address spacethat is mapped by the operating system to various electronic memoriesand mass-storage devices. The scheduler orchestrates interleavedexecution of various different application programs and higher-levelcomputational entities, providing to each application program a virtual,stand-alone system devoted entirely to the application program. From theapplication program's standpoint, the application program executescontinuously without concern for the need to share processor devices andother system devices with other application programs and higher-levelcomputational entities. The device drivers abstract details ofhardware-component operation, allowing application programs to employthe system-call interface for transmitting and receiving data to andfrom communications networks, mass-storage devices, and other I/Odevices and subsystems. The file system 446 facilitates abstraction ofmass-storage-device and memory devices as a high-level, easy-to-access,file-system interface. Thus, the development and evolution of theoperating system has resulted in the generation of a type ofmulti-faceted virtual execution environment for application programs andother higher-level computational entities.

While the execution environments provided by operating systems haveproved to be an enormously successful level of abstraction withincomputer systems, the operating-system-provided level of abstraction isnonetheless associated with difficulties and challenges for developersand users of application programs and other higher-level computationalentities. One difficulty arises from the fact that there are manydifferent operating systems that run within various different types ofcomputer hardware. In many cases, popular application programs andcomputational systems are developed to run on only a subset of theavailable operating systems and can therefore be executed within only asubset of the different types of computer systems on which the operatingsystems are designed to run. Often, even when an application program orother computational system is ported to additional operating systems,the application program or other computational system can nonethelessrun more efficiently on the operating systems for which the applicationprogram or other computational system was originally targeted. Anotherdifficulty arises from the increasingly distributed nature of computersystems. Although distributed operating systems are the subject ofconsiderable research and development efforts, many of the popularoperating systems are designed primarily for execution on a singlecomputer system. In many cases, it is difficult to move applicationprograms, in real time, between the different computer systems of adistributed computer system for high-availability, fault-tolerance, andload-balancing purposes. The problems are even greater in heterogeneousdistributed computer systems which include different types of hardwareand devices running different types of operating systems. Operatingsystems continue to evolve, as a result of which certain olderapplication programs and other computational entities may beincompatible with more recent versions of operating systems for whichthey are targeted, creating compatibility issues that are particularlydifficult to manage in large distributed systems.

For all of these reasons, a higher level of abstraction, referred to asthe “virtual machine,” (“VM”) has been developed and evolved to furtherabstract computer hardware in order to address many difficulties andchallenges associated with traditional computing systems, including thecompatibility issues discussed above. FIGS. 5A-B show two types of VMand virtual-machine execution environments. FIGS. 5A-B use the sameillustration conventions as used in FIG. 4. FIG. 5A shows a first typeof virtualization. The computer system 500 in FIG. 5A includes the samehardware layer 502 as the hardware layer 402 shown in FIG. 4. However,rather than providing an operating system layer directly above thehardware layer, as in FIG. 4, the virtualized computing environmentshown in FIG. 5A features a virtualization layer 504 that interfacesthrough a virtualization-layer/hardware-layer interface 506, equivalentto interface 416 in FIG. 4, to the hardware. The virtualization layer504 provides a hardware-like interface to VMs, such as VM 510, in avirtual-machine layer 511 executing above the virtualization layer 504.Each VM includes one or more application programs or other higher-levelcomputational entities packaged together with an operating system,referred to as a “guest operating system,” such as application 514 andguest operating system 516 packaged together within VM 510. Each VM isthus equivalent to the operating-system layer 404 andapplication-program layer 406 in the general-purpose computer systemshown in FIG. 4. Each guest operating system within a VM interfaces tothe virtualization layer interface 504 rather than to the actualhardware interface 506. The virtualization layer 504 partitions hardwaredevices into abstract virtual-hardware layers to which each guestoperating system within a VM interfaces. The guest operating systemswithin the VMs, in general, are unaware of the virtualization layer andoperate as if they were directly accessing a true hardware interface.The virtualization layer 504 ensures that each of the VMs currentlyexecuting within the virtual environment receive a fair allocation ofunderlying hardware devices and that all VMs receive sufficient devicesto progress in execution. The virtualization layer 504 may differ fordifferent guest operating systems. For example, the virtualization layeris generally able to provide virtual hardware interfaces for a varietyof different types of computer hardware. This allows, as one example, aVM that includes a guest operating system designed for a particularcomputer architecture to run on hardware of a different architecture.The number of VMs need not be equal to the number of physical processorsor even a multiple of the number of processors.

The virtualization layer 504 includes a virtual-machine-monitor module518 (“VMM”) that virtualizes physical processors in the hardware layerto create virtual processors on which each of the VMs executes. Forexecution efficiency, the virtualization layer attempts to allow VMs todirectly execute non-privileged instructions and to directly accessnon-privileged registers and memory. However, when the guest operatingsystem within a VM accesses virtual privileged instructions, virtualprivileged registers, and virtual privileged memory through thevirtualization layer 504, the accesses result in execution ofvirtualization-layer code to simulate or emulate the privileged devices.The virtualization layer additionally includes a kernel module 520 thatmanages memory, communications, and data-storage machine devices onbehalf of executing VMs (“VM kernel”). The VM kernel, for example,maintains shadow page tables on each VM so that hardware-levelvirtual-memory facilities can be used to process memory accesses. The VMkernel additionally includes routines that implement virtualcommunications and data storage devices as well as device drivers thatdirectly control the operation of underlying hardware communications anddata storage devices. Similarly, the VM kernel virtualizes various othertypes of I/O devices, including keyboards, optical-disk drives, andother such devices. The virtualization layer 504 essentially schedulesexecution of VMs much like an operating system schedules execution ofapplication programs, so that the VMs each execute within a complete andfully functional virtual hardware layer.

FIG. 5B shows a second type of virtualization. In FIG. 5B, the computersystem 540 includes the same hardware layer 542 and operating systemlayer 544 as the hardware layer 402 and the operating system layer 404shown in FIG. 4. Several application programs 546 and 548 are shownrunning in the execution environment provided by the operating system544. In addition, a virtualization layer 550 is also provided, incomputer 540, but, unlike the virtualization layer 504 discussed withreference to FIG. 5A, virtualization layer 550 is layered above theoperating system 544, referred to as the “host OS,” and uses theoperating system interface to access operating-system-providedfunctionality as well as the hardware. The virtualization layer 550comprises primarily a VMM and a hardware-like interface 552, similar tohardware-like interface 508 in FIG. 5A. The hardware-layer interface552, equivalent to interface 416 in FIG. 4, provides an executionenvironment for a number of VMs 556-558, each including one or moreapplication programs or other higher-level computational entitiespackaged together with a guest operating system.

In FIGS. 5A-5B, the layers are somewhat simplified for clarity ofillustration. For example, portions of the virtualization layer 550 mayreside within the host-operating-system kernel, such as a specializeddriver incorporated into the host operating system to facilitatehardware access by the virtualization layer.

It should be noted that virtual hardware layers, virtualization layers,and guest operating systems are all physical entities that areimplemented by computer instructions stored in physical data storagedevices, including electronic memories, mass-storage devices, opticaldisks, magnetic disks, and other such devices. The term “virtual” doesnot, in any way, imply that virtual hardware layers, virtualizationlayers, and guest operating systems are abstract or intangible. Virtualhardware layers, virtualization layers, and guest operating systemsexecute on physical processors of physical computer systems and controloperation of the physical computer systems, including operations thatalter the physical states of physical devices, including electronicmemories and mass-storage devices. They are as physical and tangible asany other component of a computer since, such as power supplies,controllers, processors, busses, and data storage devices.

A VM or virtual application, described below, is encapsulated within adata package for transmission, distribution, and loading into avirtual-execution environment. One public standard for virtual-machineencapsulation is referred to as the “open virtualization format”(“OVF”). The OVF standard specifies a format for digitally encoding a VMwithin one or more data files. FIG. 6 shows an OVF package. An OVFpackage 602 includes an OVF descriptor 604, an OVF manifest 606, an OVFcertificate 608, one or more disk-image files 610-611, and one or moredevice files 612-614. The OVF package can be encoded and stored as asingle file or as a set of files. The OVF descriptor 604 is an XMLdocument 620 that includes a hierarchical set of elements, eachdemarcated by a beginning tag and an ending tag. The outermost, orhighest-level, element is the envelope element, demarcated by tags 622and 623. The next-level element includes a reference element 626 thatincludes references to all files that are part of the OVF package, adisk section 628 that contains meta information about all of the virtualdisks included in the OVF package, a network section 630 that includesmeta information about all of the logical networks included in the OVFpackage, and a collection of virtual-machine configurations 632 whichfurther includes hardware descriptions of each VM 634. There are manyadditional hierarchical levels and elements within a typical OVFdescriptor. The OVF descriptor is thus a self-describing, XML file thatdescribes the contents of an OVF package. The OVF manifest 606 is a listof cryptographic-hash-function-generated digests 636 of the entire OVFpackage and of the various components of the OVF package. The OVFcertificate 608 is an authentication certificate 640 that includes adigest of the manifest and that is cryptographically signed. Disk imagefiles, such as disk image file 610, are digital encodings of thecontents of virtual disks and device files 612 are digitally encodedcontent, such as operating-system images. A VM or a collection of VMsencapsulated together within a virtual application can thus be digitallyencoded as one or more files within an OVF package that can betransmitted, distributed, and loaded using well-known tools fortransmitting, distributing, and loading files. A virtual appliance is asoftware service that is delivered as a complete software stackinstalled within one or more VMs that is encoded within an OVF package.

The advent of VMs and virtual environments has alleviated many of thedifficulties and challenges associated with traditional general-purposecomputing. Machine and operating-system dependencies can besignificantly reduced or eliminated by packaging applications andoperating systems together as VMs and virtual appliances that executewithin virtual environments provided by virtualization layers running onmany different types of computer hardware. A next level of abstraction,referred to as virtual data centers or virtual infrastructure, provide adata-center interface to virtual data centers computationallyconstructed within physical data centers.

FIG. 7 shows virtual data centers provided as an abstraction ofunderlying physical-data-center hardware components. In FIG. 7, aphysical data center 702 is shown below a virtual-interface plane 704.The physical data center consists of a virtual-data-center managementserver computer 706 and any of various different computers, such as PC708, on which a virtual-data-center management interface may bedisplayed to system administrators and other users. The physical datacenter additionally includes generally large numbers of servercomputers, such as server computer 710, that are coupled together bylocal area networks, such as local area network 712 that directlyinterconnects server computer 710 and 714-720 and a mass-storage array722. The physical data center shown in FIG. 7 includes three local areanetworks 712, 724, and 726 that each directly interconnects a bank ofeight server computers and a mass-storage array. The individual servercomputers, such as server computer 710, each includes a virtualizationlayer and runs multiple VMs. Different physical data centers may includemany different types of computers, networks, data-storage systems anddevices connected according to many different types of connectiontopologies. The virtual-interface plane 704, a logical abstraction layershown by a plane in FIG. 7, abstracts the physical data center to avirtual data center comprising one or more device pools, such as devicepools 730-732, one or more virtual data stores, such as virtual datastores 734-736, and one or more virtual networks. In certainimplementations, the device pools abstract banks of server computersdirectly interconnected by a local area network.

The virtual-data-center management interface allows provisioning andlaunching of VMs with respect to device pools, virtual data stores, andvirtual networks, so that virtual-data-center administrators need not beconcerned with the identities of physical-data-center components used toexecute particular VMs. Furthermore, the virtual-data-center managementserver computer 706 includes functionality to migrate running VMs fromone server computer to another in order to optimally or near optimallymanage device allocation, provides fault tolerance, and highavailability by migrating VMs to most effectively utilize underlyingphysical hardware devices, to replace VMs disabled by physical hardwareproblems and failures, and to ensure that multiple VMs supporting ahigh-availability virtual appliance are executing on multiple physicalcomputer systems so that the services provided by the virtual applianceare continuously accessible, even when one of the multiple virtualappliances becomes compute bound, data-access bound, suspends execution,or fails. Thus, the virtual data center layer of abstraction provides avirtual-data-center abstraction of physical data centers to simplifyprovisioning, launching, and maintenance of VMs and virtual appliancesas well as to provide high-level, distributed functionalities thatinvolve pooling the devices of individual server computers and migratingVMs among server computers to achieve load balancing, fault tolerance,and high availability.

FIG. 8 shows virtual-machine components of a virtual-data-centermanagement server computer and physical server computers of a physicaldata center above which a virtual-data-center interface is provided bythe virtual-data-center management server computer. Thevirtual-data-center management server computer 802 and avirtual-data-center database 804 comprise the physical components of themanagement component of the virtual data center. The virtual-data-centermanagement server computer 802 includes a hardware layer 806 andvirtualization layer 808 and runs a virtual-data-centermanagement-server VM 810 above the virtualization layer. Although shownas a single server computer in FIG. 8, the virtual-data-centermanagement server computer (“VDC management server”) may include two ormore physical server computers that support multipleVDC-management-server virtual appliances. The virtual-data-centermanagement-server VM 810 includes a management-interface component 812,distributed services 814, core services 816, and a host-managementinterface 818. The host-management interface 818 is accessed from any ofvarious computers, such as the PC 708 shown in FIG. 7. Thehost-management interface 818 allows the virtual-data-centeradministrator to configure a virtual data center, provision VMs, collectstatistics and view log files for the virtual data center, and to carryout other, similar management tasks. The host-management interface 818interfaces to virtual-data-center agents 824, 825, and 826 that executeas VMs within each of the server computers of the physical data centerthat is abstracted to a virtual data center by the VDC management servercomputer.

The distributed services 814 include a distributed-device scheduler thatassigns VMs to execute within particular physical server computers andthat migrates VMs in order to most effectively make use of computationalbandwidths, data-storage capacities, and network capacities of thephysical data center. The distributed services 814 further include ahigh-availability service that replicates and migrates VMs in order toensure that VMs continue to execute despite problems and failuresexperienced by physical hardware components. The distributed services814 also include a live-virtual-machine migration service thattemporarily halts execution of a VM, encapsulates the VM in an OVFpackage, transmits the OVF package to a different physical servercomputer, and restarts the VM on the different physical server computerfrom a virtual-machine state recorded when execution of the VM washalted. The distributed services 814 also include a distributed backupservice that provides centralized virtual-machine backup and restore.

The core services 816 provided by the VDC management server VM 810include host configuration, virtual-machine configuration,virtual-machine provisioning, generation of virtual-data-center alertsand events, ongoing event logging and statistics collection, a taskscheduler, and a device-management module. Each physical servercomputers 820-822 also includes a host-agent VM 828-830 through whichthe virtualization layer can be accessed via a virtual-infrastructureapplication programming interface (“API”). This interface allows aremote administrator or user to manage an individual server computerthrough the infrastructure API. The virtual-data-center agents 824-826access virtualization-layer server information through the host agents.The virtual-data-center agents are primarily responsible for offloadingcertain of the virtual-data-center management-server functions specificto a particular physical server to that physical server computer. Thevirtual-data-center agents relay and enforce device allocations made bythe VDC management server VM 810, relay virtual-machine provisioning andconfiguration-change commands to host agents, monitor and collectperformance statistics, alerts, and events communicated to thevirtual-data-center agents by the local host agents through theinterface API, and to carry out other, similar virtual-data-managementtasks.

The virtual-data-center abstraction provides a convenient and efficientlevel of abstraction for exposing the computational devices of acloud-computing facility to cloud-computing-infrastructure users. Acloud-director management server exposes virtual devices of acloud-computing facility to cloud-computing-infrastructure users. Inaddition, the cloud director introduces a multi-tenancy layer ofabstraction, which partitions VDCs into tenant-associated VDCs that caneach be allocated to an individual tenant or tenant organization, bothreferred to as a “tenant.” A given tenant can be provided one or moretenant-associated VDCs by a cloud director managing the multi-tenancylayer of abstraction within a cloud-computing facility. The cloudservices interface (308 in FIG. 3) exposes a virtual-data-centermanagement interface that abstracts the physical data center.

FIG. 9 shows a cloud-director level of abstraction. In FIG. 9, threedifferent physical data centers 902-904 are shown below planesrepresenting the cloud-director layer of abstraction 906-908. Above theplanes representing the cloud-director level of abstraction,multi-tenant virtual data centers 910-912 are shown. The devices ofthese multi-tenant virtual data centers are securely partitioned inorder to provide secure virtual data centers to multiple tenants, orcloud-services-accessing organizations. For example, acloud-services-provider virtual data center 910 is partitioned into fourdifferent tenant-associated virtual-data centers within a multi-tenantvirtual data center for four different tenants 916-919. Eachmulti-tenant virtual data center is managed by a cloud directorcomprising one or more cloud-director server computers 920-922 andassociated cloud-director databases 924-926. Each cloud-director servercomputer or server computers runs a cloud-director virtual appliance 930that includes a cloud-director management interface 932, a set ofcloud-director services 934, and a virtual-data-center management-serverinterface 936. The cloud-director services include an interface andtools for provisioning multi-tenant virtual data center virtual datacenters on behalf of tenants, tools, and interfaces for configuring andmanaging tenant organizations, tools and services for organization ofvirtual data centers and tenant-associated virtual data centers withinthe multi-tenant virtual data center, services associated with templateand media catalogs, and provisioning of virtualization networks from anetwork pool. Templates are VMs that each contains an OS and or one ormore VMs containing applications. A template may include much of thedetailed contents of VMs and virtual appliances that are encoded withinOVF packages, so that the task of configuring a VM or virtual applianceis significantly simplified, requiring only deployment of one OVFpackage. These templates are stored in catalogs within a tenant'svirtual-data center. These catalogs are used for developing and stagingnew virtual appliances and published catalogs are used for sharingtemplates in virtual appliances across organizations. Catalogs mayinclude OS images and other information relevant to construction,distribution, and provisioning of virtual appliances.

Considering FIGS. 7 and 9, the VDC-server and cloud-director layers ofabstraction can be seen, as discussed above, to facilitate employment ofthe virtual-data-center concept within private and public clouds.However, this level of abstraction does not fully facilitate aggregationof single-tenant and multi-tenant virtual data centers intoheterogeneous or homogeneous aggregations of cloud-computing facilities.

FIG. 10 shows virtual-cloud-connector nodes (“VCC nodes”) and a VCCserver, components of a distributed system that provides multi-cloudaggregation and that includes a cloud-connector server andcloud-connector nodes that cooperate to provide services that aredistributed across multiple clouds. VMware vCloud™ VCC servers and nodesare one example of VCC server and nodes. In FIG. 10, seven differentcloud-computing facilities are shown 1002-1008. Cloud-computing facility1002 is a private multi-tenant cloud with a cloud director 1010 thatinterfaces to a VDC management server 1012 to provide a multi-tenantprivate cloud comprising multiple tenant-associated virtual datacenters. The remaining cloud-computing facilities 1003-1008 may beeither public or private cloud-computing facilities and may besingle-tenant virtual data centers, such as virtual data centers 1003and 1006, multi-tenant virtual data centers, such as multi-tenantvirtual data centers 1004 and 1007-1008, or any of various differentkinds of third-party cloud-services facilities, such as third-partycloud-services facility 1005. An additional component, the VCC server1014, acting as a controller is included in the private cloud-computingfacility 1002 and interfaces to a VCC node 1016 that runs as a virtualappliance within the cloud director 1010. A VCC server may also run as avirtual appliance within a VDC management server that manages asingle-tenant private cloud. The VCC server 1014 additionallyinterfaces, through the Internet, to VCC node virtual appliancesexecuting within remote VDC management servers, remote cloud directors,or within the third-party cloud services 1018-1023. The VCC serverprovides a VCC server interface that can be displayed on a local orremote terminal, PC, or other computer system 1026 to allow acloud-aggregation administrator or other user to accessVCC-server-provided aggregate-cloud distributed services. In general,the cloud-computing facilities that together form amultiple-cloud-computing aggregation through distributed servicesprovided by the VCC server and VCC nodes are geographically andoperationally distinct.

As mentioned above, while the virtual-machine-based virtualizationlayers, described in the previous subsection, have received widespreadadoption and use in a variety of different environments, from personalcomputers to enormous, distributed computing systems, traditionalvirtualization technologies are associated with computational overheads.While these computational overheads have steadily decreased, over theyears, and often represent ten percent or less of the totalcomputational bandwidth consumed by an application running above a guestoperating system in a virtualized environment, traditionalvirtualization technologies nonetheless involve computational costs inreturn for the power and flexibility that they provide.

While a traditional virtualization layer can simulate the hardwareinterface expected by any of many different operating systems, OSLvirtualization essentially provides a secure partition of the executionenvironment provided by a particular operating system. As one example,OSL virtualization provides a file system to each container, but thefile system provided to the container is essentially a view of apartition of the general file system provided by the underlyingoperating system of the host. In essence, OSL virtualization usesoperating-system features, such as namespace isolation, to isolate eachcontainer from the other containers running on the same host. In otherwords, namespace isolation ensures that each application is executedwithin the execution environment provided by a container to be isolatedfrom applications executing within the execution environments providedby the other containers. A container cannot access files that are notincluded in the container's namespace and cannot interact withapplications running in other containers. As a result, a container canbe booted up much faster than a VM, because the container usesoperating-system-kernel features that are already available andfunctioning within the host. Furthermore, the containers sharecomputational bandwidth, memory, network bandwidth, and othercomputational resources provided by the operating system, without theoverhead associated with computational resources allocated to VMs andvirtualization layers. Again, however, OSL virtualization does notprovide many desirable features of traditional virtualization. Asmentioned above, OSL virtualization does not provide a way to rundifferent types of operating systems for different groups of containerswithin the same host and OSL-virtualization does not provide for livemigration of containers between hosts, high-availability functionality,distributed resource scheduling, and other computational functionalityprovided by traditional virtualization technologies.

FIG. 11 shows an example server computer used to host three containers.As discussed above with reference to FIG. 4, an operating system layer404 runs above the hardware 402 of the host computer. The operatingsystem provides an interface, for higher-level computational entities,that includes a system-call interface 428 and the non-privilegedinstructions, memory addresses, and registers 426 provided by thehardware layer 402. However, unlike in FIG. 4, in which applications rundirectly above the operating system layer 404, OSL virtualizationinvolves an OSL virtualization layer 1102 that provides operating-systeminterfaces 1104-1106 to each of the containers 1108-1110. Thecontainers, in turn, provide an execution environment for an applicationthat runs within the execution environment provided by container 1108.The container can be thought of as a partition of the resourcesgenerally available to higher-level computational entities through theoperating system interface 430.

FIG. 12 shows an approach to implementing the containers on a VM. FIG.12 shows a host computer similar to that shown in FIG. 5A, discussedabove. The host computer includes a hardware layer 502 and avirtualization layer 504 that provides a virtual hardware interface 508to a guest operating system 1102. Unlike in FIG. 5A, the guest operatingsystem interfaces to an OSL-virtualization layer 1104 that providescontainer execution environments 1200-1208 to multiple applicationprograms.

Note that, although only a single guest operating system and OSLvirtualization layer are shown in FIG. 12, a single virtualized hostsystem can run multiple different guest operating systems withinmultiple VMs, each of which supports one or more OSL-virtualizationcontainers. A virtualized, distributed computing system that uses guestoperating systems running within VMs to support OSL-virtualizationlayers to provide containers for running applications is referred to, inthe following discussion, as a “hybrid virtualized distributed computingsystem.”

Running containers above a guest operating system within a VM providesadvantages of traditional virtualization in addition to the advantagesof OSL virtualization. Containers can be quickly booted in order toprovide additional execution environments and associated resources foradditional application instances. The resources available to the guestoperating system are efficiently partitioned among the containersprovided by the OSL-virtualization layer 1204 in FIG. 12, because thereis almost no additional computational overhead associated withcontainer-based partitioning of computational resources. However, manyof the powerful and flexible features of the traditional virtualizationtechnology can be applied to VMs in which containers run above guestoperating systems, including live migration from one host to another,various types of high-availability and distributed resource scheduling,and other such features. Containers provide share-based allocation ofcomputational resources to groups of applications with guaranteedisolation of applications in one container from applications in theremaining containers executing above a guest operating system. Moreover,resource allocation can be modified at run time between containers. Thetraditional virtualization layer provides for flexible and scaling overlarge numbers of hosts within distributed computing systems and a simpleapproach to operating-system upgrades and patches. Thus, the use of OSLvirtualization above traditional virtualization in a hybrid virtualizeddistributed computing system, as shown in FIG. 12, provides many of theadvantages of both a traditional virtualization layer and the advantagesof OSL virtualization.

Computer-Implemented Methods and Systems for Performing IntelligentSampling of Normal and Erroneous Application Traces

A distributed application comprises multiple VMs or containers that runapplication components simultaneously on one or more host servercomputers of a distributed computing system. The components aretypically executed separately in the VMs or containers. The servercomputers are networked together so that information processingperformed by the distributed application is distributed over the servercomputers, allowing the VMs or containers to exchange data. Thedistributed application can be scaled to satisfy changing demands byincreasing or decreasing the number of VMs or containers. As a result, atypical distributed application can process multiple requests frommultiple clients at the same time.

FIG. 13 shows an example of a virtualization layer 1302 that is executedin a physical data center 1304. For the sake of illustration, thevirtualization layer 1302 is shown separated from the physical datacenter 1304 by a virtual-interface plane 1306. The physical data center1304 is an example of a distributed computing system. The physical datacenter 1304 comprises physical objects, including an administrationcomputer system 1308, any of various computers, such as PC 1310, onwhich a virtual data center (“VDC”) management interface may bedisplayed to system administrators and other users, computers, such ascomputers 1312-1319, data storage devices, and network devices. Eachcomputer may have multiple network interface cards (“NIC”) that providehigh bandwidth and networking to other computers and data storagedevices in the physical data center 1304. The computers may be mountedin racks (not shown) that are networked together to form server-computergroups within the data center 1304. The example physical data center1304 includes three computer groups, each of which have eight computers.For example, computer group 1320 comprises interconnected computers1312-1319 that are connected to a mass-storage array 1322 via a switch(not shown). Within each computer group, certain computers are groupedtogether to form clusters. Each cluster provides an aggregated set ofresources, such as processors, memory, and disk space, (i.e., resourcepool) to objects in the virtualization layer 1302. Physical data centersare not limited to the example physical data center 1304. Differentphysical data centers may include many different types of computers,networks, data-storage systems, and devices connected according to manydifferent types of connection topologies.

The virtualization layer 1302 includes virtual objects, such as VMs,applications, and containers, hosted by the computers in the physicaldata center 1304. The virtualization layer 1302 also includes a virtualnetwork (not illustrated) comprising virtual switches, virtual routers,load balancers, and virtual NICs. Certain computers host VMs andcontainers as described above. For example, computer 1318 hosts twocontainers identified as Cont₁ and Cont₂; cluster of computers 1313 and1314 host five VMs identified as VM₁, VM₂, VM₃, VM₄, and VM₅; computer1324 hosts four VMs identified as VM₇, VM₈, VM₉, VM₁₀. Other computersmay host applications as described above with reference to FIG. 4. Forexample, computer 1326 hosts a standalone application identified asApp₁.

In FIG. 13, the VMs VM₁, VM₂, VM₃, VM₄, and VM₅ are applicationcomponents of a distributed application executed on the cluster ofserver computers 1313 and 1314. The resources of the server computers1313 and 1314 provide a resource pool for the five VMs. The VMs enabledifferent software components of the distributed application to run ondifferent operating systems, share the same pool of resources, and sharedata. The VMs VM₁, VM₂, VM₃, VM₄, and VM₅ may provide web services tocustomers. For example, VM₁ may provide frontend services that enablesusers to purchase items sold by an owner of the distributed applicationover the Internet. VMs VM₂-VM₅ execute backend operations that completeeach user's purchase, such as collecting money from a user's bank,charging a user's credit card, updating a user's information, updatingthe owner's inventory, and arranging for products to be shipped to theuser. The VMs VM₇, VM₈, VM₉, and VM₁₀ execute a second distributedapplication on the server computer 1324. Containers Cont₁ and Cont₂,execute components of a third distributed application on the servercomputer 1318.

Application tracing tracks an application's flow and data progressionwith the results for each execution of the application presented in aseparate application trace. An application trace, also called a “trace,”represents a workflow executed by an application or a distributedapplication. A trace represents how a request, such as a user or clientrequest, propagates through components of a distributed application orthrough services provided by each component of a distributedapplication. A trace consists of one or more spans. Each span representsan amount of time spent executing a service or performance of a functionof the application. Application traces may be used in troubleshooting toidentify interesting patterns or performance problems with theapplication itself, the resources used to execute the application, andthe network.

FIGS. 14A-14B show an example of a distributed application and anexample application trace. FIG. 14A shows an example of five servicesprovided by a distributed application. The services are represented byblocks identified as Service₁, Service₂, Service₃, Service₄, andService₅. The services may be web services provided to customers. Forexample, Service₁ may be a web server that enables a user to purchaseitems sold by the application owner and communicates with otherservices. The services Service₂, Service₃, Service₄, and Service₅ arecomputational services that perform different functions to complete theuser's request. The components perform different functions of adistributed application and are executed in separate VMs on one or moreserver computers or using shared resources of a resource pool providedby a cluster of server computers. For example, services Service₁,Service₂, Service₃, Service₄, and Service₅ are performed by fiveapplication components in VMs VM₁, VM₂, VM₃, VM₄, and VM₅, respectively,of FIG. 13. Directional arrows 1401-1405 represent requests for aservice provided by the services Service₁, Service₂, Service₃, Service₄,and Service₅. For example, directional arrow 1401 represents a user'srequest for a service offered by Service₁, such as a functionalityprovided by a web server. After a request has been issued by the user,directional arrows 1403 and 1404 represent Service₁ requests forexecution of services or functions performed by Service₂ and Service₃.Dashed directional arrows 1406 and 1407 represent responses. Forexample, Service₂ sends a response to Service₁ indicating that theoperations performed by Service₃ and Service₄ have been completed.Service₁ then requests services from Service₅ as represented bydirectional arrow 1405, and provides a response to the user, asrepresented by directional arrow 1407.

FIG. 14B shows an example trace of the distributed applicationrepresented in FIG. 14A. Directional arrow 1408 is a time axis. Theorder in which services are executed are listed in column 1409. Theservices perform different functions indicated in parenthesis withservice Service₁ and Service₂ performing more than one function. Eachbar represents a time span, which is an amount of time (i.e., duration)spent performing one of the functions provided by a service. Unshadedbars 1410-1412 represent spans of time spent executing the differentfunctions performed by Service₁. For example, bar 1410 represents thespan of time Service₁ spends interacting with a user. Bar 1411represents the span of time Service₁ spends interacting with theservices provided by Service₂, Hash marked bars 1414-1415 representspans of time spent executing Service₂ with services Service₃ andService₄. Shaded bar 1416 represents a span of time spent executingService₃. Dark hash marked bar 1418 represents a span of time spentexecuting Service₄. Cross-hatched bar 1420 represents a span of timespent executing Service₅.

Traces are classified according to trace type which is given by the spanof the first service, operation, or function performed by anapplication. The first span is called the “root span” which is used asthe trace type and is denoted by TT. For example, the span 1410 of thetrace shown in FIG. 14B is the root span of the trace and is used toidentify the trace type. A trace may also be classified by the order inwhich different services or functions are performed by an application.For example, the ordered sequence of services or functions listed incolumn 1409 may be used to define a trace type denoted by a 7-tuple:(Service₁, Service₁, Service₂, Service₃, Service₄, Service₁, Service₅).Each trace has a corresponding duration or total time of the tracedenoted by D. The duration is the amount of time taken by theapplication to complete a request or perform a series of functionsrequested by a client. For example, time interval 1422 is the duration Dof the trace shown in FIG. 14B, which represents the total amount timetaken to execute the services in FIG. 14A.

Modern distributed applications generate enormous numbers of traces eachday. For example, a shopping website may be accessed and used hundredsof thousands of times each day, resulting in storage of hundreds ofthousands of corresponding traces in a data storage device. Many of thetraces may be nearly identical and correspond to nearly identicaloperations performed by an application. Traces that correspond to normaloperations performed by an application are identified as normal traces.On the other hand, erroneous traces that are used to troubleshootperformance of the application and identify a root cause a problem withthe application are often produced with a much lower frequency thanother normal traces.

FIGS. 15A-15B show examples of erroneous traces for the distributedapplication represented in FIG. 14A. In FIG. 15A, dashed line bars1501-1504 represent normal spans for services provided by Service₁,Service₂, Service₄, and Service₅ as represented by spans 1515, 1518,1512, and 1520 in FIG. 14B. Spans 1506 and 1508 represent shortenedspans for Service₂, and Service₄. No spans are present for Service₁ andService₅ as indicated by dashed bars 1503 and 1504, indicating that theapplication components that perform Service₁ and Service₅ failed toexecute. The trace illustrated in FIG. 15A is identified as an erroneoustrace. In FIG. 15B, a latency pushes the spans 1512 and 1520 associatedwith executing corresponding Service₁ and Service₅ to later times.

Application traces may be assigned status codes that indicate whetherexecution of a particular operation or response to a client request by acorresponding application is a success or a failure. In oneimplementation, erroneous traces may be identified by corresponding HTTP(“hyper-text transfer protocol”) status codes. For example, HTTP is aprotocol used to transfer data over the World Wide Web, HTTP is part ofan Internet protocol suite that defines commands and services used fortransmitting webpage data. In one implementation, traces are assigned,or tagged with, HTTP status codes that indicate the status of a specificHTTP requests associated with execution of an application. Inparticular, traces tagged with HTTP error status codes 4XX (i.e.,request contains bad syntax or cannot be fulfilled) and server errorstatus codes 5XX (i.e., the server failed to fulfil an apparently validrequest), where X represents a positive integer, are erroneous traces.For example, when data has been successfully transmitted by theapplication, or application components, to a client or vis-a-vis, thecorresponding trace may be tagged with the HTTP status code 200,indicating a success and the trace is identified as a normal trace. Onthe other hand, when data has not been successfully transmitted betweenthe application and a client or between application components, thecorresponding trace is an erroneous trace that is tagged with the HTTPstatus code 400, indicating a failed or bad request.

In another implementation, when hardware and/or network used by anapplication experiences particular failures, user-defined status codesmay be used to tag corresponding traces as erroneous traces. Forexample, if CPU usage or memory usage spikes or drops below a thresholdwhile an application is executing a corresponding trace may be tagged aserroneous. In another example, when data packets are dropped by one ormore VMs executing application components of an application, acorresponding trace may be tagged as an erroneous trace.

In another implementation, user-defined status codes may be used to tagspans of traces. An erroneous trace contains one or more spans that havebeen tagged with an error. For example, spans 1506 and 1508 in FIG. 15Aare tagged with error_tag=TRUE. As a result, the trace represented byFIG. 15A is tagged as an erroneous trace. In another example, thelatency shifted spans 1512 and 1520 in FIG. 1513 are tagged witherror_tag=TRUE. As a result, the trace represented by FIG. 15B is taggedas an erroneous trace. Alternatively, when a distributed applicationperforms without issue, as described above with reference to the examplein FIG. 14B, the corresponding spans may be tagged with error_tag=FALSEand the trace is tagged as a normal trace.

Erroneous traces of an application tend to have shorter or longerdurations than the typical trace duration. For example, during typicalexecution of an application a corresponding trace has duration, D, thatfalls between lower and upper limits denoted by D_(l)<D<D_(u), whereD_(l) is a lower time limit and D_(u) is an upper time limit. WhenD≤D_(l) or D_(u)<D, performance of the application is abnormal, and thecorresponding trace is identified as an erroneous trace. The upper andlower time limits may be the upper and lower thresholds of a histogramconstructed as described below with reference to Equations (6a) and (6b)under histogram creation.

In recent years, application management tools have been developed toapply different sampling procedures that reduce the amount of storagededicated to storing traces. The sampling procedures include rate-basedsampling and duration-based sampling. Rate-based sampling, also called“probabilistic sampling,” stores a fixed percentage of the generatedtraces. Duration-based sampling stores traces with durations that aregreater than a predefined threshold. However, these conventionalsampling procedures fail to distinguish the different trace types anddurations during sampling which leads to information distortion.Information distortion occurs when infrequently occurring traces are notincluded in the sampled traces. For example, conventional trace samplingprocedures fail to consider the frequencies of different trace types andtrace durations. Erroneous traces are often infrequently generated andcontain information that is useful in troubleshooting a performanceproblem with an application. Because conventional sampling procedures donot make a distinction between that is useful in troubleshooting aperformance problem with an application. Because conventional samplingprocedures do not make a distinction between high and low frequencygenerated trace types and trace durations, there is a risk that sampledtraces obtained using conventional sampling procedures will not containany, or not contain a sufficient representation, of erroneous traces,resulting in a loss of potentially important information needed introubleshooting performance of an application. As a result,troubleshooting performance problems without a sufficient representationof erroneous traces leads to inaccurate representation of a performanceproblem and misleads troubleshooting algorithms and systemadministrators in detecting the root cause of the performance problem.One approach is to use error-based sampling, which stores only erroneoustraces. However, in certain situations the number of erroneous tracesfar exceeds the number of normal application traces, which eventuallyleads to the same problem of not having enough storage space availablefor normal and erroneous traces.

Computer-implemented methods and systems described below performintelligent sampling of normal and erroneous traces. The traces aregenerated for an application. The sampling and compression describedbelow may be performed in real time on a stream of traces or performedon traces read from a trace database. Computer-implemented intelligentsampling described below stores enough normal and erroneous tracesacross the different trace types and different durations regardless offrequency to enable accurate troubleshooting of performance problemswithout information distortion created by conventional samplingprocedures. In particular, computer-implemented intelligent samplingmethods and systems described below generate different sampling ratesfor normal and erroneous traces. The sampling rates for low frequencynormal and erroneous traces are larger than the sampling rates forhigher frequency normal and erroneous traces. The sampling rates ensurethat low frequency normal and erroneous traces are sampled with a largersampling rate than high frequency normal and erroneous traces.Troubleshooting and root cause analysis is applied to the samplederroneous traces to identify the source of performance problems with theapplication and the application components. Computer-implemented methodsand systems may then employ remedial measures to correct the performanceproblems. For example, VMs or containers executing applicationcomponents may be migrated to different hosts to increase performance.Additional VM or containers may be started to alleviate the workloads onalready existing VMs and containers. Network bandwidth may be increasedto reduce latency between peer VMs.

FIG. 16A shows an example graphical-user interface (“GUI”) 1600 thatenables a user to select an application and input sampling rates forsampling normal and erroneous traces of the selected application. Auser, such as a system administrator or application owner, selects anapplication from a list of applications provided in window 1602. Forexample, highlighted entry 1604 indicates a user has select “Application6” by clicking on the application name with the cursor. The example GUI1600 includes two ways the user may input a sampling rate. The samplingrate is the percentage (i.e., fraction) of normal and erroneous tracesthat are to be sampled from a set of trace data and stored in a datastorage device. When trace types and durations are known, a user maychoose to sample by trace type and/or trace duration by clicking onbutton 1606. When sampling does not depend on trace type or tracedurations, a user may choose to sample based on normal and erroneousstatus of the traces alone by clicking on button 1608. After clicking onbutton 1606 or button 1608, the user clicks on button 1609 and selectsone of the preset sampling levels identified as “conservative” 1610,“aggressive” 1611, and “super aggressive” 1613. Conservative,aggressive, and super aggressive sampling rates correspond to differentfractions of traces sampled from runtime traces or a database traces foran application. For example, a conservative sampling rate is used tosample and store a larger number of traces than an aggressive samplingrate and an aggressive sampling rate is used to sample and store alarger number of traces than a super aggressive sampling rate. In thisexample, conservative, aggressive, and super aggressive sampling ratesare preset and correspond to storing 15%, 10%, and 5% of the traces inthe data storage device. Rather than using one of the preset samplingrates, the user may also choose to input a sampling level by clicking onbutton 1614 and entering a sampling level (i.e., sample rate) in field1616. When a user would like to set an overall sampling rate of the setof trace data and set an erroneous trace sampling rate, instead ofclicking on either button 1606 or 1608, the user clicks on button 1618and enters an overall sampling rate in field 1620 and an erroneous tracesampling rate in field 1622. A user may choose to sample the set oftrace data stored in a database of traces by clicking on button 1624 andentering a location of the database in field 1626. Alternatively, theuser may choose to sample runtime traces of“Application 6” as the tracesare generated by clicking on button 1628. When a user clicks on the“Execute sampling” button 1630, sampling is executed on the traces inaccordance with the user's selections as described below.

Computer-implemented methods and systems for intelligent sampling ofapplication traces described below are encoded in machine-readableinstructions that are executed in a computer system, such as a servercomputer. FIG. 16B shows an example of a computer system 1632 thatexecutes machine-readable instructions for sampling traces produced byan application 1634. The traces may be sent directly from theapplication to the computer system 1632 as indicated by directionalarrow 1636. Alternatively, the traces may be stored in a trace database1638, as indicated by directional arrow 1640, and the computer system1632 reads the traces from the database 1638 as indicated by directionalarrow 1642. The computer system 1632 applies the user-selected samplingrate to the traces as described below and stores sampled normal tracesand erroneous traces in the data storage device 1644, thereby reducingthe overall number of traces. Troubleshooting a performance problem withthe application 1634 is performed on the sampled erroneous traces storedin the data storage device 1644. Troubleshooting is a systematictechnique in which the erroneous traces are used to identify aperformance problem with the application 1644 or a performance problemwith the hardware or network of a distributed computing system used torun the application 1644. When a performance problem has beenidentified, computer-implemented methods and systems execute remedialmeasures to correct the problem.

Sampling Known Trace Types and Durations with Normal and ErroneousTraces

Computer-implemented methods described below perform three differentprocesses for sampling normal and erroneous traces with known tracetypes and durations. One process performs trace-type sampling of normaland erroneous traces based on frequencies of trace types. A secondprocess performs sampling of erroneous and normal traces based ondurations of traces independent of the trace type. A third processperforms a hybrid trace-type and duration sampling of normal anderroneous traces. Each process is described separately below,

Trace-Type Sampling of Known Trace Types with Normal and ErroneousTraces

FIG. 17 shows an example set of trace data 1702 generated by anapplication. The trace data 1702 may be stored in a trace database orsent directly to the server computer 1602 and stored in a buffer. Eachrow represents information associated with a trace. Each trace isassigned a trace identification (“ID”). Column 1704 is a list of traceIDs assigned to the traces in the trace data. Column 1706 is a list ofdurations of the traces. Column 1708 lists the trace type (i.e., rootspan). Columns 1710 list K different services or functions as describedabove with reference to FIG. 14B. Each entry in columns 1710 contains aspan-tuple (span—name(k), t_(s), t_(e)), where span—name(k) is a spanname, t_(s) is a start time of the span, and t_(e) is an end time of thespan. Entries in column 1711 record the status codes of the traces. Thestatus codes indicate whether a trace is a normal trace, denoted by“norm,” or an erroneous trace, denoted by “err.” In one implementation,the status codes may be simple binary norm and err designations. Inanother implementation, the status codes for erroneous traces maycontain more information, such as HTTP status codes or user-definedstatus codes. For example, trace 1712 has a trace ID “ef167gp7.” a traceduration of 00:12.46 seconds, and a trace type “Service₁,” which is thename of the root span of the trace, and “err” status code indicatingthat the trace 1712 is an erroneous trace.

The traces recorded in a set of trace data are sorted into groups oftraces with the same trace type independent of trace durations andstatus code. The number of traces in each group of traces are counted.The traces of each trace type are partitioned into normal traces anderroneous traces. For each trace type, a normal trace-type sampling rateis determined for the normal traces and an erroneous trace-type samplingrate is determined for the erroneous traces. Suppose a set of trace datacontains N traces with M different trace types (i.e., M≤N). Let N_(m) bethe number of traces with the m-th trace type, where index m=1, . . . ,M. Let N_(n) ^((m)) be the number of normal traces in the group of m-thtrace types and N_(e) ^((m)) be the number of erroneous traces in thegroup of m-th trace types, where N_(m)=N_(e) ^((m))+N_(n) ^((m)). Afrequency of occurrence of normal traces of the m-th trace type is

$\begin{matrix}{p_{n}^{(m)} = \frac{N_{n}^{(m)}}{N_{m}}} & ( {1a} )\end{matrix}$

and frequency of occurrence of erroneous traces in the m-th trace typeis

$\begin{matrix}{p_{e}^{(m)} = \frac{N_{e}^{(m)}}{N_{m}}} & ( {1b} )\end{matrix}$

The normal trace-type sampling rate of each of the normal traces of them-th trace type is

h _(n) ^((m))=1−(p _(n) ^((m)))^(β) ^(n)   (2a)

where 0≤β_(n) and is called the “normal trace-type sampling parameter.”

and the erroneous trace-type sampling rate of each of the erroneoustraces of the m-th trace type is

h _(e) ^((m))=1−(p _(n) ^((m)))^(β) ^(e)   (2b)

where 0≤β_(e) is called the “erroneous trace-type sampling parameter.”

The normal trace-type sampling rate is the inverse of the frequency ofoccurrence of normal traces with the m-th trace type. Similarly, theerroneous trace-type sampling rate is the inverse of the frequency ofoccurrence of erroneous traces with the m-th trace type. Each trace typehas associated sampling rates represented by Equation (2a) or (2b). Thenormal trace-type sampling rate in Equation (2a) is the fraction ofnormal traces that belong to the m-th trace type and are sampled andstored in a data storage device. The erroneous trace-type sampling ratein Equation (2b ) is the fraction of erroneous traces that belong to them-th trace type and are sampled and stored in a data storage device.

Returning to FIG. 17, the traces of the set of trace data 1702 aresorted according to trace types to obtain sorted trace types 1714. Forthe sake of illustration, the durations and span information areomitted. The trace types are denoted by TT_(m), where m=1, . . . , M,and Mk the number of different trace types in the trace data. Traces ofthe same trace type may be normal or erroneous traces. For example,normal trace 1716 is a TT₂ trace type with a normal status code “norm”1718. By contrast, erroneous trace 1720 is also a TT₂ trace type with anerroneous status code “err” 1722.

The trace-type sampling parameters β_(n) and β_(e) corresponds to theamount of normal and erroneous traces sampled and are based onuser-selected sampling rates described below. Note that in oneimplementation β_(n)≠β_(e) and in another implementation β_(n)=β_(e).For example,“conservative” sampling corresponds to β=1, “aggressive”sampling corresponds to β=0.5, and “super aggressive” samplingcorresponds to =0,25, where β represents β_(n) and β_(e). The trace-typesampling parameters β_(n) and β_(e) are determined based theuser-selected sampling rate as described below.

The number of normal traces of the m-th trace type stored in the datastorage device is given by:

N _(n) ^((m)) =N _(n) ^((m)) ×h _(n) ^((m)))   (3a)

The number of erroneous traces of the m-th trace type stored in the datastorage device is given by

N _(e) ^((m)) =N _(e) ^((m)) ×h _(e) ^((m))   (3a)

The number of traces N _(n) ^((m)) and N _(e) ^((m)) are rounded to thenearest integer number. The N _(n) ^((m)) normal traces are randomlysampled from the N_(n) ^((m)) normal traces and are stored in a datastorage device as described below. The remaining unsampled normal tracesof the m-th trace type (i.e., N_(n,rem)=N_(n) ^((m))−N _(n) ^((m)) arediscarded by deleting the remaining normal traces from the data storagedevice or from a buffer where traces are temporarily stored duringsampling. The N _(e) ^((m)) erroneous traces are randomly sampled fromthe N_(e) ^(m)) erroneous traces and are stored in a data storage deviceas described below. The remaining unsampled erroneous traces of thein-th trace type (i.e., N_(e,rem)=N_(e) ^((m))−N _(e) ^(m)) arediscarded by deleting the remaining erroneous traces from the datastorage device or from a buffer where traces are temporarily storedduring sampling.

FIG. 18 shows an example calculation of the number of traces sampledfrom a set of trace data 1802 already sorted according to trace type.The set of trace data comprises N traces and M different groups oftraces where the traces in each group are the same trace type. Forexample, group of traces 1804 has N_(m) traces with the same trace typeTT_(m). For each group of traces, the traces are partitioned into normaltraces and erroneous traces based on the corresponding status codes. InFIG. 18, the group of traces 1804 is partitioned according to statuscodes into normal traces 1806 and erroneous traces 1808. There are N_(n)^((m)) normal traces 1806 and N_(e) ^((m)) erroneous traces 1808.Frequency of occurrences 1810 and 1812 are computed as described abovewith reference to Equations (1a) and (1b) for the normal and erroneoustraces 1806 and 1808, respectively. Sampling rates 1814 and 1816 arecomputed as described above with reference to Equations (2a) and (2b)for the normal and erroneous traces 1806 and 1808, respectively. Thenormal trace-type sampling rate 1814 is used to sample and store N _(n)^((m)) of the normal traces 1806. The erroneous trace-type sampling rate1816 is used to sample and store N _(e) ^((m)) of the erroneous traces1808.

The trace-type sampling rates represented by Equations (2a) and (2b)ensures that rarely occurring normal and erroneous trace types aresampled at a higher sampling rates than are more frequently occurringnormal and erroneous trace types. Suppose the m-th trace type contains1,000 traces (i.e., N_(m)=1,000) with 145 erroneous traces (i.e., N_(e)^((m))=145) and 855 normal traces (i.e., N_(n) ^((m))=855). Thefrequency of occurrence of the erroneous traces of trace type TT_(m) isp_(e)=0.145 and the frequency of occurrence of the normal traces oftrace type TT_(m) is p_(m)=0.655. The following table shows the normaland erroneous trace-type sampling rates using the same value for thesampling parameter (i.e., β=β_(e)=β_(n)):

Table of Normal and Erroneous Trace-type Sampling Rates 1 ConservativeAggressive Sup. Agg. Status Code (β = 1) (β = 0.5) (β = 0.25) normal0.345 0.191 0.100 erroneous 0.855 0.619 0.383The entries in the above table show that as the sampling parameterdecreases, the sampling rates also decrease. Note also that the lessfrequently occurring erroneous traces are sampled with larger samplingrates than the more frequently occurring normal traces across theconservative, aggressive, and super aggressive sampling rates.

In an alternative implementation, the erroneous trace types may befurther partitioned based on the types of status codes, such as HTTPerror status codes or user-define error status codes described above. Afrequency of occurrence of erroneous traces in the m-th trace type is

$\begin{matrix}{p_{e,u}^{(m)} = \frac{N_{e,u}^{(m)}}{N_{m}}} & ( {4a} )\end{matrix}$

where

-   -   subscript u denotes a particular error status code:    -   u=1, . . . , U; and    -   U the total number of error status codes.        For example, error status code u may represent one of the HTTP        error status codes 4XX and 5XX or a user-defined error status        code. The erroneous trace-type sampling rate of the error status        code u is given by

h _(e,u) ^((m))=1−(p _(e,u) ^((m)))^(β) ^(e)   (4b)

The number of erroneous traces of the m-th trace type with error statuscode u that are sampled and stored in the data storage device is givenby

N _(e,u) ^((m)) =N _(e,u) ^((m)) ×h _(e,u) ^((m))   (4c)

The sampling rate represented by Equation (4b) ensures that rarelyoccurring erroneous traces are sampled at a higher sampling rates thanare more frequently occurring erroneous traces.

FIG. 19 shows an example of the erroneous traces 1808 in FIG. 18partitioned into U sets of erroneous traces based on U error statuscodes. For example, erroneous traces 1901 have error status code err₁,erroneous traces 1902 have error status code err_(u), and erroneoustraces 1903 have error status code err_(u). Erroneous trace-typesampling rates 1904-1906 are computed for corresponding sets oferroneous traces 1901-1903, as described above with reference toEquations (4a) and (4b). The erroneous trace-type sampling rates1904-1906 are used to obtain N _(e,1) erroneous traces 1908, N _(e,u)erroneous traces 1909, and N _(e,U) erroneous traces 1910.

For each trace type, the normal and erroneous traces have separatecompression ratios and compression rates. A modified Gini index for thefraction of normal traces sampled from the set of trace data across theM different trace types:

$G_{n}^{(\beta)} = {{\frac{{\overset{\_}{N}}_{n}}{N}{where}{\overset{\_}{N}}_{n}} = {\sum\limits_{m = 1}^{M}{\overset{\_}{N}}_{n}^{(m)}}}$

The compression rate across normal traces with different trace types isgiven by

C _(n) ^((β))=1−G _(n) ^((β))   (5a)

A modified Gini index for the fraction of erroneous traces sampled fromthe set of trace data across the A/different trace types:

$G_{e}^{(\beta)} = {{\frac{{\overset{\_}{N}}_{e}}{N}{where}{\overset{\_}{N}}_{e}} = {\sum\limits_{m = 1}^{M}{\overset{\_}{N}}_{e}^{(m)}}}$

The compression rate across erroneous traces with different trace typesis given by

C _(e) ^((β))=1−G _(e) ^((β))   (5b)

A modified Gini index for the fraction of normal and erroneous tracessampled from the set of trace data across the M different trace types:

$G^{(\beta)} = \frac{\overset{\_}{N}}{N}$

where N=N _(e)+N _(n).

The compression rate is given by

C ^((β))=1−G ^((β))   (5c)

Diversity of frequencies of occurrence may be measured by the modifiedGini index. For example, trace-type sampling may be selected when themodified Gini index satisfies the following condition:

G^((β))≤Th_(G)   (5d)

where

-   -   G^((β)) represents G_(e) ^((β)) or G_(n) ^((β));    -   β_(e)=β_(n)=β; and    -   Th_(G) is a modified Gini index threshold (e.g., Th_(G)=0.1,        0.05, or 0.01).        When the conditions given in Equation (5d) is not satisfied,        trace-type information is not adequate for investigating        performance of an application.

Duration Sampling of Normal and Erroneous Traces

Computer-implemented methods perform duration sampling on tracedurations independent of the trace type. Erroneous traces usually haveshort durations or long durations. Traces of the trace data are sortedbased on duration. For example, the traces may be sorted from shortest(longest) duration to longest (shortest) duration. The duration-samplingrates described below are used to separately sample normal and erroneoustraces in corresponding bins of the histogram, where each bincorresponds to a time interval.

FIG. 20 shows an example of the set of trace data 1702 sorted accordingto trace durations to obtain duration-sorted traces 2002. For the sakeof illustration, trace types, span information, and status informationare omitted. The durations are denoted by D_(n), where n=1, . . . , N.In this example, the traces are sorted from longest duration to shortestduration with D₁ representing the longest trace duration and D_(N)representing the shortest duration.

Computer-implemented methods compute upper and lower thresholds fordistinguishing normal traces from erroneous traces of theduration-sorted traces. Traces with durations between the upper andlower thresholds are identified as normal tracs. Traces with durationsthat are greater than the upper threshold or less than the lowerthreshold are identified as erroneous traces. A histogram is constructedfor the traces with normal traces having durations that fall between thelower and upper thresholds and erroneous traces have durations that areless than the lower threshold or greater than the upper threshold.

Upper and lower quantiles are used to partition the duration-sortedtraces into three groups of traces. The upper and lower quantiles aregiven by

M(upper)=q _(1−s)

M(lower)=q _(s)

where 0≤s≤1(e.g., s=0.05 or s=0.1).

The lower quantile q_(s) is a time that partitions the duration-sortedtraces such that s traces have durations that are less than or equal tothe quantile q_(s). The upper quantile q_(1−s) is a time that partitionsthe duration-sorted traces such that s traces have durations that aregreater than or equal to the quantile q_(1−s). For example, if s=0.1,the lower quantile q_(0.1) denotes a time that partitions theduration-sorted traces such that 10% of the traces have durations thatare less than or equal to q_(0.1) and the upper quantile q_(0.9) denotesa time that partitions the duration-sorted traces such that 10% of thetraces have durations that are greater than or equal to q₀₉. Upperdistances are computed for traces with durations that are greater thanor equal to the upper quantile by

dist(upper)=|data(upper)−M(upper)|  (6a)

and lower distances are computed for traces with durations that are lessthan or equal to the lower quantile by

dist(tower)=|data(lower)−M(lower)|  (6b)

where

data(upper) represents a trace duration that is greater than or equal toM(upper); and

data(lower) represents a trace duration that is less than or equal toM(low).

A mean average deviation (“MAD”) is computed for the set of upperdistances and is denoted by MAD (upper). A MAD is computed for the setof lower distances and is denoted by MAD (lower). Upper and lowerthresholds for the duration-sorted traces are computed as follows:

Th _(upper)=min(M(upper)+Γ×MAD(upper), max (duration))   (7a)

and

Th _(lower)=max(M(lower)−Γ×MAD (lower),min(duration))   (7b)

where

-   -   0<Γ<1 (e.g., Γ=0.25, 0.20, or 0.30);    -   max(duration) is the maximum trace duration; and    -   min(duration) is the minimum trace duration.        A trace duration D_(n) is identified as an outlier if the trace        duration satisfies one of the following conditions:

D_(n)>Th_(upper)   (8a)

D_(n)<Th_(lower)   (8b)

A histogram is constructed from traces with durations that satisfy thefollowing condition:

Th_(upper)≥D_(n)≥Th_(lower)   (8c)

FIG. 21 shows an example of partitioning duration-sorted traces 2102.Directional arrow 2104 represents increasing durations of the traceswith trace 2106 have the maximum duration max(duration) trace 2108having the minimum duration min(duration). Mark 2110 represents an upperquantile q_(1−s). Mark 2112 represents a lower quantile q_(s). Thequantiles q_(1−s) and q_(s) partition the duration-sorted traces 2102into three groups of traces. The first group comprises s fraction of thetraces with durations greater than or equal to q_(1−s). The second groupcomprises s fraction of the traces with durations less than or equal toq_(s). The third group comprises 1-2s fraction of traces with durationsbetween q₅ and q_(1−s). Distances are calculated for traces 2114according to Equation (6a). For example, directional arrow 2116represents a distance between trace duration 2118 and the lower quantileq_(s) 2012. Distances are calculated for traces 2120 according toEquation (6b). For example, directional arrow 2122 represents a distancebetween trace duration 2124 and the upper quantile q_(1−s) 2010. The MADMAD (lower) is computed for the distances associated with traces 2114.Is The MAD MAD (upper) is computed for the distances associated withtraces 2120. Lower threshold 2026 is computed using Equation (7b). Upperthreshold 2128 is computed using Equation (7a). In this example, traces2130 with durations that are less than the lower threshold 2126 and areidentified as outliers and traces 2132 with durations that are greaterthan the upper threshold 2128 and are identified as outliers.

Traces with durations that satisfy either of the conditions given byEquations (8a) and (8b) are erroneous traces that lie within lowerinterval [min(duration), Th_(lower)) and the upper interval (Th_(upper),max(duration)]. respectively. The range of time between the upper andlower thresholds is partitioned into B equal duration intervals denotedby [c_(b−1), c_(b)) for b=1, . . . , B−1, and [c_(B−1), c_(B)], wherec₀=Th_(lower) and c_(B)=Th_(upper). Each bin of the histogramcorresponds to a time interval. A trace with a duration that satisfiesthe condition given by Equation (8c) is identified as a normal trace. Anormal trace that lies within one of the intervals is assigned to a binthat corresponds to the interval. The number of traces in each bin arecounted and denoted by n_(b), where b=1, . . . , B . For example, n_(b)represents the total number of traces in the interval [c_(B−1), c) andn_(B) represents the total number of traces in the interval [c_(B-1),c_(B)]. The number of erroneous traces that lie within the lowerinterval [min(duration), T _(lower)) are denoted by n_(s) and form ashort-duration bin of erroneous traces. The number of traces that liewithin the upper interval (T h_(upper),max(duration)] are denoted byn_(L) and form a long-duration bin of erroneous traces. A histogram oftraces is constructed by counting the number of traces in each bin.

FIG. 22 shows an example histogram constructed from a set of trace data.Horizontal axis 2202 represents time. Vertical axis 2204 representsnumber of traces. Time axis 2202 is partitioned into intervals between alower threshold 2206 and an upper threshold 2208. In this example, thetime axis 2202 includes a minimum duration 2210 and a maximum duration2212. Unshaded bars represent the number of normal traces with durationsthat lie within the intervals (i.e., number of traces that lie withincorresponding bins). For example. FIG. 22 shows a magnified view ofintervals 2214 and 2216. Bar 2218 represents the number of traces,n_(b), in the interval 2214. Bar 2220 represents the number of traces,n_(b+1), in the interval 2016. Shaded bar 2222 represents the number oferroneous traces in the lower interval [min(duration),Th_(low)). Shadedbar 2224 represents the number of erroneous traces in the upper interval(Th_(upp), max(duration)].

A histogram may also be constructed for the trace durations using thet-digest approach described in “Computing extremely accurate quantilesusing t-digests,” T. Dunning et. al., arXiv.org, Cornell University.Feb. 11, 2019. Instead of storing the entire set of trace data based ontrace durations. t-digest stores only the results of data clustering,such as centroids of clusters and trace counts in each cluster.

A histogram of traces in the B bins is given by

Hist(B)={n _(s) , n ₁ , . . . , n _(B) , n _(L)}

where

n_(b) is the number of traces in the b-th bin with durations in theinterval [c_(b−1), c_(b)) for b=1, . . . , B−1;

n_(B) is the number of traces in the B-th bin with durations in theinterval [c_(B−1), c_(B)];

n_(s) is the number of short duration traces (i.e., erroneous traces) inthe interval [min(duration),Th_(lower)); and

n_(L) is the number of long duration traces (i.e., erroneous traces) inthe interval

(Th_(upper), max(duration)].

The frequency of occurrence of traces in the b-th bin of the histogramis given by:

$\begin{matrix}{p_{b} = {{\frac{n_{b}}{N_{H}}{where}N_{H}} = {{\sum\limits_{b = 1}^{B}n_{b}} + n_{S} + n_{L}}}} & (9)\end{matrix}$

The normal duration sampling rate for normal traces in the b-th bin isgiven by

r _(b)=1−(p _(b))^(a) ^(n)   (10)

where 0≤αa_(n) and is called the “normal duration sampling parameter.”

The normal duration-sampling rates in Equation (10) is the fractions oftraces to be sampled from the b-th bin and stored in a data storagedevice. The frequency of occurrence of traces in the S-th bin of thehistogram is given by:

$\begin{matrix}{p_{S} = \frac{n_{S}}{N_{H}}} & ( {11a} )\end{matrix}$

The frequency of occurrence of traces in the L-th bin of the histogramis given by:

$\begin{matrix}{p_{L} = \frac{n_{L}}{N_{H}}} & ( {11b} )\end{matrix}$

The short trace duration sampling rate for traces in the S-th bin isgiven by

h _(s)=1−(p _(s))^(α) ^(e)   (12a)

and the long duration sampling rate for traces in the L-th bin is givenby

h _(L)=1−(p _(L))   (12b)

where 0≤α_(e) and is called the “erroneous duration sampling parameter.”

The normal duration-sampling rates in Equations (12a) and (12b) are thefractions of traces to be sampled from the corresponding s-th and i-thbins and stored in a data storage device.

Note that in one implementation α_(n)≠α_(e) and in anotherimplementation α_(n)=α_(e).The duration-sampling parameter a correspondsto an amount of trace sampling based on the user-selected sampling leveldescribed above. For example, “conservative” sampling corresponds toα=1, “aggressive” sampling corresponds to α=0.5, and “super aggressive”sampling corresponds to α=0.25. The duration-sampling parameter α may beselected to provide the user-selected sampling level as described below.

The normal and erroneous duration-sampling rates in Equation (10). (12a)and (12b) may be different for each bin and is inversely proportional tothe frequency of occurrences of the traces in each bin. For example,suppose the number of traces in a histogram comprises 10,000 traces with460 traces in a bin B₁ (i.e., n₁=460) and 2.035 traces in a bin B₂(i.e., n₂=2,035). The frequency of occurrence of traces in B₁ isp₁=0.046 and the frequency of occurrence of traces in B₂ is p₂=0.204.The following table shows the duration-sampling rates for the exampletraces in B₁ and B₂:

Table of Duration-sampling Rates Conservative Aggressive Sup. Agg. Bins(α = 1) (α = 0.5) (α = 0.25) B₁ 0.954 0.786 0.537 B₂ 0.796 0.548 0.328Note that the less frequently occurring traces in the bin B₁ are sampledwith a larger duration-sampling rate than the more frequently occurringtraces in the bin B₂ across the conservative. aggressive, and superaggressive sampling rates.

The number of normal traces sampled from the b-th bin and stored in thedata storage device is given by:

n _(b) =n _(b) ×r _(b)   (13)

where n _(b) is rounded to the nearest integer number.

The number of erroneous traces sampled from the s-th and l-th bins andstored in the data storage device is given by:

n _(s) =n _(s) ×h _(s)   (14a)

n _(L) =n _(L) ×h   (14h)

where n _(s) and n _(L) are rounded to the nearest integer number.

The remaining unsampled traces are discarded by deleting the unsampledtraces from a data storage device.

Returning to FIG. 22, a frequency of occurrence p_(b) 2226 is computedfor traces with durations in the interval [c_(b−1), c_(b)) 2214. Aduration-sampling rate r_(b) 2228 is computed for traces in the interval[c_(b−1), c_(b)) 2214 (i.e., corresponding b-th bin). The number oftraces sampled from the b-th bin are n _(b) 2230. The set of samplenormal traces 2232 is obtained from sampling each bin of the traces. Afrequency of occurrence p_(s) 2234 and erroneous duration sampling rate2236 is computed for traces with durations in the short durationinterval [min(duration),Th_(lower)). The number of traces sampled fromthe short duration is nn_(s) 2238. A frequency of occurrence p_(L) 2240and erroneous duration sampling rate 2242 are computed for traces withdurations in the long duration interval (Th_(upper),max(duration)]. Thenumber of traces sampled from the short duration is n _(L) 2244.

The modified Gini index equals the fraction of traces samples from thebins. For the normal traces, the modified Gini index is given by

$G_{e}^{(\alpha)} = {{\frac{{\overset{\_}{N}}_{H}}{N_{H}}{where}{\overset{\_}{N}}_{H}} = {\sum\limits_{b = 1}^{B}{\overset{\_}{n}}_{b}}}$$N_{H} = {\sum\limits_{b = 1}^{B}n_{b}}$

The compression rate across the traces with normal durations is given by

C _(n) ^((α))=1−G _(n) ^((α))   (15a)

For the erroneous traces, the modified Gini index is given by

$G_{e}^{(\alpha)} = \frac{{\overset{\_}{n}}_{L} + {\overset{\_}{n}}_{S}}{n_{L} + n_{S}}$

The compression rate across the traces with erroneous durations is givenby

C _(e) ^((α))=1−G _(e) ^((α))   (15b)

A modified Gini index for the fraction of normal and erroneous tracessampled from the set of trace data across the M different trace types:

$G^{(\alpha)} = \frac{{\overset{\_}{N}}_{H} + {\overset{\_}{n}}_{L} + {\overset{\_}{n}}_{S}}{N}$

The compression rate across erroneous traces with different trace typesis given by

C ^((α))=1−G ^((α))   (15c)

where α_(e)=α_(n)=α.

Hybrid Sampling of Known Trace Types and Durations with Normal andErroneous Traces

When both trace types and trace durations are important fortroubleshooting performance of an application, a hybrid combination oftrace-type sampling and duration-based sampling may be applied acrossdifferent trace types and different trace durations for normal anderroneous traces.

A set of trace data is sorted into different trace types as describedabove with reference to FIG. 17. The traces of each trace type arepartitioned into normal traces and erroneous traces as described abovewith reference to FIG. 18. Normal and erroneous trace-type samplingrates are computed for each of M trace types. For the normal traces withthe m-th trace type, a frequency of occurrence of normal traces p_(n)^((m)) is computed as described above with reference to Equation (1 a)and FIG. 18. For the erroneous traces, a frequency of occurrence oferroneous traces p_(e) ^((m)) is computed as described above withreference to Equation (1 b) and FIG. 18. Separate histograms areconstructed for the normal traces and for the erroneous traces asdescribed above with reference to FIGS. 20 and 21. A frequency ofoccurrence is computed for normal traces in each bin of a histogram ofthe normal traces of the m-th trace type as follows

$\begin{matrix}{p_{n,b}^{(m)} = \frac{n_{n,b}^{(m)}}{N_{n,H}^{(m)}}} & ( {16a} )\end{matrix}$

where

-   -   subscript n denotes normal traces:    -   b=1, . . . , B;    -   n_(n,b) ^((m)) is the number of normal traces with the m-th        trace type in the b-th bin; and

$N_{n,H}^{(m)} = {\sum\limits_{b = 1}^{B}n_{n,b}^{(m)}}$

A frequency of occurrence is computed for erroneous traces in each binof a histogram of erroneous traces of the m-th trace type as follows:

$\begin{matrix}{p_{e,b}^{(m)} = \frac{n_{e,b}^{(m)}}{N_{e,H}^{(m)}}} & ( {16b} )\end{matrix}$

where

-   -   subscript e denotes normal traces; and    -   n_(e,b) ^((m)) is the number of normal traces with the m-th        trace type in the b-th bin

$N_{e,H}^{(m)} = {\sum\limits_{b = 1}^{B}n_{e,b}^{(m)}}$

FIG. 23 shows an example set of normal traces 2302 and example set oferroneous traces 2304 for the same in-th trace type. Traces in the sets2302 and 2304 have the same trace type TT_(m). The traces in each set ofnormal and erroneous traces are sorted according to trace duration asdescribed above with reference to FIG. 20. Trace durations of erroneoustraces are denoted by D_(N) _(e) _((m)), where N_(e) ^((m)) 2306 is thenumber of erroneous traces of the m-th trace type. A frequency ofoccurrence of the traces p_(e) ^((m)) 2308 is determined for the set2302. FIG. 23 shows an example erroneous trace histogram 2310constructed from the set of erroneous traces 2302. The range of time ispartitioned into B equal duration intervals denoted by [c_(b−1), c_(b)),for b=1, B−1. and [c_(B−1), c_(B)] as described above with reference toFIG. 22. The number of erroneous traces in each interval (i.e., bin) arecounted and denoted n_(e,b) ^((m)). A set of frequency of occurrences2312 is computed for each bin, as described above with reference toEquation (16b), to obtain frequencies of occurrences of erroneous tracesof m-th trace type. Trace durations of normal traces are denoted byD_(N) _(n) _((m)), where N_(e) ^((m)) 2314 is the number of erroneoustraces of the m-th trace type. A frequency of occurrence of the tracesp_(e) ^((m)) 2316 is determined for the set 2304. A set of frequency ofoccurrences 2318 is computed for each bin of a normal trace histogramconstructed for the set of normal traces 2304, as described above withreference to Equation (16a), to obtain frequencies of occurrences ofnormal traces of m-th trace type.

A normal hybrid sampling rate for each bin of the m-th set of normaltraces is given by

h _(n,b) ^((m))=1−(p _(n) ^((m)))^(β) ^(n) (p _(n,b) ^((m)))^(α) ^(n)  (17a)

where 0≤α_(n) and 0≤β_(n) are normal trace sampling parameters.

The normal hybrid-sampling rate in Equation (13) may be different foreach bin of each group of traces and is inversely proportional to thefrequency of occurrences of the traces in each bin and each group oftraces.

An erroneous hybrid sampling rate for each bin of the m-th set of normaltraces is given by)

h _(e,b) ^((m))=1−(p _(e) ^((m)))^(β) ^(e) (p _(e,b) ^((m)))^(α) ^(e)  (17b)

where 0≤α_(e) and 0≤β_(e) are erroneous trace sampling parameters.

The erroneous hybrid-sampling rate in Equation (17b) may be differentfor each bin of each group of traces and is inversely proportional tothe frequency of occurrences of the traces in each bin and each group oftraces.

There are many ways in which the sampling parameters in Equations (17a)and (17b) be selected for sampling. In one implementation, α_(e)=β_(e)and α_(n)=β_(n), but α_(e)≠α_(n). In another implementation, α_(e)=α_(n)and β_(e)=β_(n), but α_(e)≠β_(e). In another implementation, thesampling parameters are the same with α_(e)=β_(e)=β_(n), butα_(e)≠β_(e). In still another implementation, the sampling parametersare different with α_(e)≠β_(e)≠α_(n)≠β_(n).

The number of normal traces sampled from the b-th bin and stored in thedata storage device is given by:

n _(n,b) ^((m)) =n _(n,b) ^((m)) ×h _(n,b) ^((m))   (18a)

and the number of erroneous traces sampled from the b-th bin and storedin the data storage device is given by:

n _(e,b) ^((m)) =n _(e,b) ^((m)) ×h _(e,b) ^((m))   (18b)

where n _(n,b) ^((m)) and n _(e,b) ^((m)) are rounded to the nearestinteger number.

Remaining unsampled traces are discarded by deleting the unsampledtraces from a data storage device.

The modified Gini index of the normal traces equals the fraction ofnormal traces sample from the bins:

$G_{n}^{({\beta,\alpha})} = {\frac{{\overset{\_}{N}}_{n}}{N_{n}}{where}}$${\overset{\_}{N}}_{n} = {\sum\limits_{m = 1}^{M}{\sum\limits_{b = 1}^{B}{\overset{\_}{n}}_{n,b}^{(m)}}}$$N_{n} = {\sum\limits_{m = 1}^{M}{\sum\limits_{b = 1}^{B}n_{n,b}^{(m)}}}$

The compression rate for normal hybrid sampling of the traces is givenby

C _(n) ^((β,α))=1−G _(n) ^((β,α))   (19a)

The modified Gini index of the erroneous traces equals the fraction ofnormal traces sample from the bins:

$G_{n}^{({\beta,\alpha})} = {\frac{{\overset{\_}{N}}_{e}}{N_{e}}{where}}$${\overset{\_}{N}}_{n} = {\sum\limits_{m = 1}^{M}{\sum\limits_{b = 1}^{B}{\overset{\_}{n}}_{e,b}^{(m)}}}$$N_{n} = {\sum\limits_{m = 1}^{M}{\sum\limits_{b = 1}^{B}n_{e,b}^{(m)}}}$

The compression rate for erroneous hybrid sampling of the traces isgiven by

C _(e) ^((β,α))=1−G _(e) ^((β,α))   (19b)

A modified Gini index for the fraction of normal and erroneous tracessampled from the set of trace data across the M different trace types:

$G^{({\beta,\alpha})} = \frac{\overset{\_}{N}}{N}$

where

-   -   β_(e)=β_(n)=β;    -   α_(e)=α_(n)=α; and    -   N=N _(e)+N _(n).        The compression rate across erroneous traces with different        trace types is given by

C ^((β,α))=1−G ^((βα))   (19c)

Sampling Based Only on Normal and Erroneous Traces

Trace sampling may be performed on a set of trace data regardless oftrace type and/or trace duration. A set of trace data is partitionedinto normal traces and erroneous traces. Let N be the total number oftraces in a set of trace data. Let N_(n) be the total number of normaltraces in the set of trace data. Let N_(e) be the total of erroneoustraces in the set of trace data.

FIG. 24 shows an example of the set of trace data 1702 partitioned intonormal traces 2402 and erroneous traces 2404 based only on the erroneousand normal status of the traces in the set 1702. For example, column1711 list the normal and erroneous status codes of the traces in the set1702, such as entries 2406 and 2408. The set of trace data 1702 containsN total number traces. The normal trace data 2402 contains N_(n) normaltraces. The erroneous trace data 2402 contains N_(e) normal traces.

A frequency of occurrence of normal traces is given by

$\begin{matrix}{p_{n} = \frac{N_{n}}{N}} & ( {20a} )\end{matrix}$

and a frequency of occurrence of erroneous traces is given by

$\begin{matrix}{p_{e} = \frac{N_{n}}{N}} & ( {20b} )\end{matrix}$

where N=N_(n)+N_(e).

The normal sampling rate for sampling the normal traces is given by

h _(n)=1−(p _(n))^(β) ^(n)   (21a)

The erroneous sampling rate for sampling the erroneous traces is givenby

h _(e)=1−(p _(e))^(β) ^(e)   (21b)

The sampling rates represented by Equations (21a) and (21 b) ensuresthat rarely occurring normal and erroneous traces are sampled at ahigher sampling rates than are more frequently occurring normal anderroneous traces.

The number of normal traces stored in the data storage device is givenby:

N _(n) =N _(n) ×h _(e)   (22a)

The number of erroneous traces stored in the data storage device isgiven by

N _(e) =N _(e) ×h _(e)   (22a)

The number of traces N _(n) and N _(e) are rounded to the nearestinteger number.

The sampling parameters β_(n) and β_(e) corresponds to the amount ofnormal and erroneous traces sampled and are based on user-selectedsampling rates described below. Note that in one implementationβ_(n)≠β_(e) and in another implementation β_(n)=β_(e). For example,“conservative” sampling corresponds to β=1, “aggressive” samplingcorresponds to β=0.5, and “super aggressive” sampling corresponds toβ=0.25, where β represents β_(n) and β_(e). The sampling parametersβ_(n) and β_(e) are determined based the user-selected sampling rate asdescribed below.

The modified Gini index for the normal sampling rate is given by

$G_{n}^{(\beta)} = \frac{{\overset{\_}{N}}_{n}}{N_{n}}$

The compression rate across normal traces is given by

C _(n) ^((β))=1−G _(n) ^((β))   (23a)

The modified Gini index for the erroneous sampling rate is given by

$G_{e}^{(\beta)} = \frac{{\overset{\_}{N}}_{e}}{N_{e}}$

The compression rate across erroneous traces is given by

C _(e) ^((β))=1−G_(e) ^((β))   (23b)

The sampling parameters β_(n) and β_(e) may be independently selectedbased on desired sampling rates h_(n), and h_(e) or compression rates. Amodified Gini index for the fraction of normal and erroneous tracessampled from the set of trace data across the M different trace types:

$G^{(\beta)} = \frac{\overset{\_}{N}}{N}$

where N=N _(e)+N _(n).

The compression rate is given by

C ^((β))=1−G^((β))   (23c)

In another implementation, the set of erroneous traces may bepartitioned further based on error codes, such as HTTP error statuscodes 4XX and 5XX or user-defined error status codes. FIG. 25 shows anexample of the set of erroneous traces 2404 partitioned into U set oferroneous traces. each set of erroneous traces corresponding to adifferent error code. In the example of FIG. 25, three sets 2501-2503 ofU sets of erroneous traces are represented. The erroneous traces in eachset have the same error code. For example, set 2501 has error code err₁,set 2502 has error code err_(u), and set 2503 has error code err_(U).

A frequency of occurrence of erroneous traces is given by

$\begin{matrix}{p_{e,u} = \frac{N_{e,u}}{N_{e}}} & (24)\end{matrix}$

where u=1, . . . , U.

The erroneous sampling rate for sampling the set of erroneous traceswith error code u is give by

h _(e.u)=1−(p _(e,u))^(β) ^(e)   (25)

The modified Gini index for the erroneous sampling rate with errorstatus code u is given by

$G_{e,u}^{(\beta)} = \frac{{\overset{\_}{N}}_{e,u}}{N_{e}}$

The compression rate across all error status codes is given by

$\begin{matrix}{C_{e}^{(\beta)} = {{1 - {G_{e}^{(\beta)}{where}G_{e}^{(\beta)}}} = {\sum\limits_{u = 1}^{U}G_{e,u}^{(\beta)}}}} & (26)\end{matrix}$

Sampling based on a User Selected Overall Sampling Rate and an ErroneousTrace Sampling Rate

In this implementation, a user selects an overall sampling rate, h, of aset of trace data and selects an erroneous trace sampling rate h_(e).Computer-implemented methods and systems described below determine anormal trace sampling rate h_(n). The number of traces that are sampledand stored for an overall sampling rate h is given by

N =h×N

The remaining unsampled traces, N−N, are discarded. The number ofsampled traces in terms of number of sampled normal and erroneous tracesis given by

N = N _(e) N _(n) =h×N   (27)

Let h_(n) and h_(e) be the normal and erroneous trace sampling rates.The relationships between the number of normal and erroneous traces tobe sampled and the normal and erroneous trace sampling rates are givenby

N _(n) =h _(n) ×N _(n)   (28a)

N _(e) =h _(e) ×N _(e)   (28b)

Dividing Equation (26) by N gives

h _(n) ×p _(n) +h _(e) ×p _(e) =h   (29)

where p_(n)=N_(n)/N and p_(e)=N_(e)/N.

The frequencies of occurrences p_(n) and p_(e) are determined asdescribed above with reference to Equations (20a) and (20b). When a userselects the erroneous trace sampling rate, h_(e), the normal tracesampling rate is given by

$\begin{matrix}{h_{n} = \frac{h - {h_{e} \times p_{e}}}{p_{n}}} & (30)\end{matrix}$

When h_(n)>0, the user-selected erroneous trace sampling rate h_(e) canbe used to sample erroneous traces of a set of traces. Alternatively,when h_(n)≤0, an alert is trigger in a GUI, such as on a monitor of asystem administrator, and the normal traces are sampled with a presetnormal trace sampling rate. When the normal and erroneous trace samplingrates are known, sampling is performed as described above with referenceto FIG. 24 and Equations (20a)-(21b).

Suppose a set of trace data contains 260 normal traces and 170 erroneoustraces. As a result, the frequency of occurrence of normal traces isp_(n)=0.6 and the frequency of occurrence of erroneous traces isp_(e)=0.4. When a user selects an overall sampling rate of h=0.30. thefollowing table represents various combinations of normal and erroneoustrace sampling rates that may be used:

Table of Erroneous and Normal Sampling Rates for Overall Sampling Rateof h = 0.30 h_(e) 0.1 0.2 0.3 0.4 0.5 0.6 0.7  0.8  0.9 h_(n) 0.043 0.370.3 0.24 0.17 0.1 0.04 −0.027 −0.092The Table shows that when a user selects erroneous trace sampling ratesh_(e) less than or equal to 0.7. the corresponding normal sampling rateh_(n) acceptable. However, when a user selects erroneous sampling rates0.8 and 0.9, the corresponding normal sampling rates are negativevalued, which triggers an alert that is displayed on systemadministrator's monitor. The normal trace sampling rate may be set to adefault normal trace sampling rate, such as 0.04 or 0.1. Suppose a userselects an erroneous trace sampling rate of 0.60, which, according tothe Table, corresponds to a normal trace sampling rate of 0.1. Thesesampling rates will produce an overall sampling rate of h=0.30, whichcorresponds to sampling and storing 30% of the traces in the set oftrace data.

In another implementation, the processes described above may beperformed independent of user selections for the overall and erroneoustraces sampling rates. In particular, normal and erroneous samplingrates may be preset and used based on certain metrics violating acorresponding threshold. For example, red metrics, such as request rate.error rate, and duration, are associated with services in anapplication. When the error rate, for example, is less than 10%. 4% ofnormal traces are sampled and stored (i.e., h_(n)=0.04) and 1% oferroneous traces are sampled and stored (i.e., h_(e)=0.01). On the otherhand, when the error rate is greater than 10%. 4% of normal traces(i.e., h_(n)=0.04) are sampled and stored and 6% of erroneous traces aresampled and stored (i.e., h_(e)=0.01).

For a user-selected sampling rate h, the modified Gini index G^((β))=hand the sampling parameter β is obtained as described below. Themodified Gini index for the normal sampling rate is given by

$\begin{matrix}{G_{n}^{(\beta)} = \frac{{\overset{\_}{N}}_{n}}{N_{n}}} & ( {31a} )\end{matrix}$

The modified Gini index for the erroneous sampling rate is given by

$\begin{matrix}{G_{e}^{(\beta)} = \frac{{\overset{\_}{N}}_{e}}{N_{e}}} & ( {31b} )\end{matrix}$

The compression rate is given by Equation (23c).

Sampling Parameters

In one implementation, the GUI 1500 in FIG. 15 may include fields thatenable a user to input values for the sampling parameters α and β. Forexample, the GUI 1500 may include fields that enable a user to define“conservative” sampling corresponds to α=β1, “aggressive” samplingcorresponds to α=β=0.5, and “super aggressive” sampling corresponds to aα=β=0.25.

In another implementation, the sampling parameters are determined basedon the user-selected sampling level input via the GUI in FIG. 15. Thesampling rates and corresponding compression rates depend on themodified Gini index ^((γτ)) defined via a set of parameters γ={γ₁, . . ., γ_(τ)}, where γ_(i) ∈ γ. In the following discussion, the generalizedsampling parameter and represents α, β, or (β, α), where the samplingparameter α represents the erroneous sampling parameter α_(e) or thenormal sampling parameter α_(n), and the sampling parameter β representsthe erroneous sampling parameter β_(e) or the normal sampling parameterβ_(n). The efficiency of a sampling rate depends on the value of theparameters γ_(i) that will produce a user-selected sampling level asdescribed above with reference to FIG. 15. Alternatively, a user selectsa sampling level that has a corresponding sampling rate and acorresponding parameter γ_(i). For example, suppose a user defines a“conservative” sampling rate as storing 15% of unsampled traces. Theoptimal parameter value γ₀ satisfies the modified Gini index:

G ^((γ) ⁰ ⁾≈0.15

The parameter γ₀ is used as the sampling parameter. Suppose a userdefines an “aggressive” sampling rate as storing 10% of unsampledtraces. The optimal parameter value γ₁ satisfies the modified Giniindex:

G ^((γ) ¹ ⁾≈0.10

The parameter γ₁ is used as a sampling parameter. Suppose a user definesa “super aggressive” sampling as storing 5% of unsampled traces. Theoptimal parameter value γ₂ satisfies the following condition:

G ^((γ) ² ⁾≈0.05

The parameter γ₂ is used as a sampling parameter. Optimization of thesampling parameter γ is solved based on the latest historical set oftrace data. When traces of an application exhibit static behavior theset optimal parameters γ are hard coded for long term use. In case of anapplication with highly dynamic behavior, the optimal parameters γ areregularly determined.

The optimal parameters and corresponding modified Gini indices (i.e.,percentage of sampled traces) may be computed in advance. When a userselects a particular sampling level the corresponding parameter may beobtained from the predetermined relationships between the optimalparameters and the modified Gini indices (i.e., percentage of sampledtraces).

FIG. 26A shows a plot of Gini indices versus trace-type samplingparameters β. Horizontal axis 2602 represents a range of trace-typesampling parameters β. Vertical axis 2604 represents a range of modifiedGini indices. Curve 2606 represents the modified Gini index as afunction of the trace-type sampling parameter β. Table I showstrace-type sampling parameters Band the corresponding modified Giniindex values:

TABLE I Modified Gini index versus trace-type sampling parameter β β0.02 0.025 0.03 0.035 0.04 0.045 0.05 0.055 0.06 0.065 0.07 G^((β))0.046 0.058 0.07 0.08 0.09 0.10 0.11 0.12 0.13 0.14 0.15Relations (β, G^((β))) in Table 1 may be stored in a data storage deviceand retrieved from the data storage device based on a user-selectedsampling level. The trace-type sampling parameter βthat corresponds to amodified Gini index closest to the user-selected sampling level is usedto obtain the sampling rate, such as in Equations (2a), (2b), (21a), and(21b). For example, when a user selects a sampling level of 15% (i.e.,modified Gini index of 0.15), the corresponding trace-type samplingparameter 0.07 (i.e., β=0.07) is retrieved from Table I and used toobtain the sampling rate, such as in Equations (2a), (2b), (21a), and(21b). When a user selects a sampling level of 10% (i.e., modified Giniindex of 0.10), the corresponding trace-type sampling parameter 0.045(i.e., β=0.045) is retrieved from Table I and used to obtain thesampling rate, such as in Equations (2a), (2b), (21a), and (21b). When auser selects a sampling level of 5% (i.e., closest modified Gini indexis 0.045), the corresponding trace-type sampling parameter 0.02 (i.e.,β=0.02) is retrieved from Table I and used to obtain the sampling rate,such as in Equations (2a), (2b), (21a), and (21b).

FIG. 26B shows a plot of modified Gini index versus duration-samplingparameters β. Horizontal axis 2608 represents a range ofduration-sampling parameters α. Vertical axis 2610 represents a range ofmodified Gini indices. Curve 2612 represents the modified Gini index asa function of the duration-sampling parameter α. Table II showsduration-sampling parameters α and the corresponding modified Giniindices:

TABLE II Modified Gini index versus duration-sampling parameter α α 0.010.02 0.035 0.04 0.05 0.06 0.07 0.08 0.09 0.1 0.11 G^((α)) 0.015 0.030.051 0.059 0.072 0.086 0.099 0.11 0.13 0.14 0.15Relations (α, G^((α))) in Table II may be stored in a data storagedevice and retrieved from the data storage device based on auser-selected sampling level. The duration-sampling parameter α thatcorresponds to the modified Gini index closest to the user-selectedsampling level is used to obtain the duration-sampling rate in Equations(12a) and (12b). For example, when a user selects a sampling level of15% (i.e., modified Gini index of 0.15), the correspondingduration-sampling parameter 0.11 (i.e., α=0.11) is retrieved from TableII and used to obtain the duration-sampling rate in Equations (12a) and(12b). When a user selects a sampling level of 10% (i.e., closestmodified Gini index is 0.099), the corresponding duration-samplingparameter 0.07 (i.e., α=0.07) is retrieved from Table II and used toobtain the duration-sampling rate in Equations (12a) and (12b). When auser selects a sampling level of 5% (i.e., closest modified Gini indexis 0.051). the corresponding duration-sampling parameter 0.035 (i.e.,α=0.035) is retrieved from Table II and used to obtain theduration-sampling rate in Equation (9).

FIG. 26C shows a plot of Gini index versus trace-type andduration-sampling parameters β and α. Axis 2614 represents a range oftrace-type sampling parameters β. Axis 2616 represents a range ofduration-sampling parameters α. Axis 2618 represents a range of modifiedGini indices. Curve 2620 represents the Gini index as a function of thesampling parameters β and α. Table III shows sampling parameters β and αand α and the corresponding modified Gini indices:

TABLE III Modified Gini index versus sampling parameters β and α α 0.020.07 0.05 0.03 0.08 0.01 0.06 β 0.01 0.01 0.02 0.03 0.03 0.04 0.04G^((β,α)) 0.049 0.1 0.1 0.1 0.15 0.1 0.15Relations ((β, α), G^((β, α))) in Table III may be stored in a datastorage device and retrieved from the data storage device based on auser-selected sampling level. The sampling parameters β and α thatcorresponds to a modified Gini index closest to the user-selectedsampling level is used to obtain the duration-sampling rate in Equations(17a) and (17b). For example, when a user selects a sampling level of15% (i.e., modified Gini index of 0.15). the corresponding samplingparameters β=0.04 and α=0.06 are retrieved from Table III and used toobtain the hybrid sampling rate in Equations (17a) and (17b). When auser selects a sampling level of 10% (i.e., modified Gini index is0.01), a combination of the sampling parameters β and α are retrievedfrom Table III and used to obtain the hybrid sampling rate in Equations(17a) and (17b). Table III shows that different combinations of samplingparameters may be used for a modified Gini index of 0.1. In oneimplementation, when multiple combinations of a sampling parameters areavailable, rather than using different sampling parameters the number ofdifferent parameters may be reduced by using sampling parameters thatare equal, such as α=β=0.03. When a user selects a sampling level of 5%(i.e., closest modified Gini index is 0.049), the corresponding samplingparameter β5% and α=0.02 are retrieved from Table III and used to obtainthe hybrid sampling rate in Equation (13).

FIG. 26D shows a plot of modified Gini indices versus a samplingparameter. In this example, the same value is used for the trace-typesampling parameter β and the duration-sampling parameter α (i.e., α=β).Horizontal axis 2622 represents a range of sampling parameters α=β.Vertical axis 2624 represents a range of modified Gini indices. Curve2626 represents the modified Gini index as a function of the samplingparameter. Table IV shows sampling parameter α (i.e., α=β) and thecorresponding modified Gini index values:

TABLE IV Modified Gini index versus trace-type sampling parameter α α0.011 0.013 0.015 0.025 0.027 0.029 0.037 0.041 0.043 G^((α,α)) 0.040.049 0.057 0.093 0.097 0.1 0.14 0.147 0.0153Relations (α, G^((α, α))) in Table IV may be stored in a data storagedevice and retrieved from the data storage device based on auser-selected sampling level. The sampling parameter α that correspondsto the modified Gini index closest to the user-selected sampling levelis used to obtain the hybrid sampling rate in Equations (17a) and (17b)with α=β. For example, when a user selects a sampling level of 15%(i.e., the closest modified Gini index of 0.147), the correspondingsampling parameters α=β0.041 is retrieved from Table IV and used toobtain the hybrid sampling rate in Equations (17a) and (17b). When auser selects a sampling level of 10% (i.e., modified Gini index is 0.1).the corresponding hybrid sampling parameters α=β=0.029) is retrievedfrom Table IV and used to obtain the hybrid sampling rate in Equations(17a) and (17b). When a user selects a sampling level of 5% (i.e.,closest modified Gini index is 0.049), the corresponding samplingparameters α=β=0.013) is retrieved from Table IV and used to obtain thehybrid sampling rate in Equation (11).

Optimizing Sampling Parameters

In practice, historical optimization of the sampling parameters α and βis not feasible due to the dynamic nature of applications. Instead, thecompression rate of the sampling rate is monitored over for a recenttime window selected 13 a user. The duration of the time window may beone-half hour, one hour, two hours, twelve hours, or sixteen hours. Thecompression rate C^((γ)) is calculated for the corresponding samplingrate applied in the time window. After the compression rate has beencalculated for the time window, a difference is calculated between thecompression rate and a user-selected compression rate as follows:

Δ=|C ^((γ)) −C _(s)   (32)

where

γ represents the sampling parameter (i.e., γ=α, γ=β, or γ=(α, β));

C^((γ)) M is the compression rate of the sampling rate with samplingparameter γ of traces over the recent time period: and

C_(s) is the user-selected compression rate.

The user-selected compression rate is given by C_(s)=1−G_(s), whereG_(s) is the modified Gini index that corresponds to the user-selectedsampling level. For example, when a user selects a sampling level of15%, the modified Gini index is G_(s)=0.15, and the user-selectedcompression rate is C_(s)=0.85.

When the difference satisfies the following condition

Δ≤Th_(Opt)   (33)

where Th_(Opt) is the optimization threshold (e.g., Th_(Opt)=0.01, 0.02,or 0.05), the sampling rate is unchanged. On the other hand, whenΔ>Th_(Opt), the sampling parameter of the sampling rate is adjustedusing the following function;

factor (Δ)=2−exp(−10×Δ)   (34a)

where

-   -   0≤Δ≤100; and    -   1≤factor≤2.

The factor in Equation (33a) is used to compute an adjusted samplingparameter as follows:

γ_(adj)=factor×γ  (34b)

Alternatively, the factor in Equation (33a) is used to compute anadjusted sampling parameter as follows:

$\begin{matrix}{\gamma_{adj} = {\frac{1}{factor} \times \gamma}} & ( {34c} )\end{matrix}$

The adjusted sampling parameter of Equation (34b) or (34c) replaces apreviously used sampling parameter in the sampling rates describedabove.

Sampling Normal and Erroneous Traces

A trace is randomly sampled based on a Bernoulli distribution, where theprobability of a success (i.e., sampling the trace) is the sampling rater and the probability of a failure (i.e., discarding the trace) is theprobability 1−r, and where r represents the sampling rate associatedwith the trace described above. The BRBNG receives as input the samplingrate r and, based on r, randomly outputs a number 1 for a success withprobability r or randomly outputs a number 0 for a failure withprobability 1−r. For each trace in a set of traces, the sampling rate ris input to BRBNG. When the BRBNG outputs a number 1, the trace issampled by storing the trace in a data storage device. On the otherhand, when the BRBNG outputs a number 0, the trace is discarded ordeleted from memory or from a data storage device. Note that assignmentof the values 1 and 0 may be reversed provided 0 is associated withprobability of a success r and 1 is associated with probability of afailure 1−r. In an alternative implementation, a random number generator(e.g., pseudo-random number generator) is used to output a randomnumber, R, for each trace, where 0≤R≤1. When R≤r, the trace is sampledby storing the trace in a data storage device. On the other hand, whenR>r, the trace is discarded or deleted from memory or from a datastorage device.

The computer-implemented methods described below with reference to FIGS.27-39 are stored in one or more data storage devices as machine-readableinstructions that when executed by one or more processors of thecomputer system, such as the computer system shown in FIG. 1m sampletraces of an application executed in a distributed computing system.

FIG. 27 is a flow diagram illustrating an example implementation of a“method for sampling traces of an application.” In block 2701, a set oftrace data is retrieved from data storage, such as a data storage deviceor a buffer. In block 2702, a “determine a normal trace sampling rateand an erroneous trace sampling rate” procedure is performed. An exampleimplementation of the “determine a normal trace sampling rate and anerroneous trace sampling rate” procedure is described below withreference to FIG. 28. In block 2703, a “sample normal traces using thenormal trace sampling rate and the erroneous trace sampling rate”procedure is performed. An example implementation of the “sample normaltraces using the normal trace sampling rate and the erroneous tracesampling rate” procedure is described below with reference to FIG. 30.In block 2704, troubleshooting on the erroneous traces to identify thesource of performance problems with the application. Remedial measuresmay be employed to correct the performance problems. For example, VMs orcontainers executing application components may, be migrated todifferent hosts to increase performance. Additional VM or containers maybe started to alleviate the workloads on already existing VMs andcontainers. Network bandwidth may be increased to reduce latency betweenpeer VMs. In decision block 2705, when sampling continues, theoperations represented by blocks 2702-2704 are repeated.

FIG. 28 is a flow diagram illustrating an example implementation of the“determine a normal trace sampling rate and an erroneous trace samplingrate” procedure performed in block 2702. In decision block 2801, whensampling is based on trace type and/or duration, control flows to block2802. In block 2802, a “determine sampling rates based on trace typeand/or duration” procedure is performed. An example implementation ofthe “determine sampling rates based on trace type and/or duration” isdescribed below with reference to FIG. 29. In decision block 2803, whensampling is based on normal and erroneous traces alone, control flows toblock 2804. In block 2804, a “determine normal and erroneous tracesampling rates” procedure is performed. An example implementation of the“determine normal and erroneous trace sampling rates” is described belowwith reference to FIG. 34. In decision block 2805, when an overallsample rate and an erroneous sample rate is given by a user, controlflows to block 2806. In block 2806, a “determine normal trace samplingrate” procedure is performed. An example implementation of the“determine normal trace sampling rate” is described below with referenceto FIG. 35. In decision block 2807, when a red metric, such as requestrate, error rate, and duration, violates threshold, control flows toblock 2808. Otherwise, control flows to block 2809. In block 2809,normal and erroneous sampling rates are selected based on the redmetric. In block 2810, already determined or preset normal and erroneoussampling rates continue to be used.

FIG. 29 is a flow diagram illustrating an example implementation of the“determine sampling rates based on trace type and/or duration” procedureperformed in block 2802. In decision block 2901, when a user hasselected trace-type sampling of the application traces, control flows toblock 2902. In block 2902, a “determine trace-type sampling rates”procedure is performed. An example implementation of the “determinetrace-type sampling rates” procedure is described below with referenceto FIG. 28. The trace-type sampling rates (“TTSR”) are return and usedto perform sampling of the application traces in block 2703 of FIG. 27.In block 2903, a “determine hybrid-sampling rates” procedure isperformed. An example implementation of the “determine hybrid-samplingrates” procedure is described below with reference to FIG. 30. Blocks2905 and 2904 are a while loop in which a “sample traces usinghybrid-sampling rates” procedure is performed on the application traceswhile the duration of the time spent sampling, t, in block 2907 is lessthan the duration of a period of time T_(p). An example implementationof the “sample traces using hybrid-sampling rates” procedure isdescribed below with reference to FIG. 36. In block 2908, compressionrates C^((β, α))are computed according to Equations (19a) and (19b) forhybrid sampling rates in Equations (17a) and (17b). In decision block2909, when compression rate obtain in block 2908 satisfies the condition|C^((β,α))−C_(s)|<ε, where ε is a small user selected positive number(e.g., 0.01, 0.05, or 0.1) and C_(S) is the user-selected compressionrate, the hybrid-sampling rates obtained in block 2903 are used toperform sampling of the application traces in block 2703 of FIG. 27.Alternatively, when compression rate obtained in block 2908 satisfiesthe condition |C ^((α))−C_(s)|<ε, the hybrid-sampling rates (“HSR”)obtained in block 2903 are returned and used to perform sampling of theapplication traces in block 2703 of FIG. 27. Otherwise, control flows toblock 2910. In block 2910, an alert is displayed in a GUI, or sent in anemail to an administrator or application developer, indicating thehybrid sampling failed to satisfy the user-selected compression rate. Inblock 2911, a “determine duration-sampling rates” procedure isperformed. An example implementation of the “determine duration-samplingrates” procedure is described below with reference to FIG. 33. Blocks2912 and 2913 are a while loop in which a “sample traces usingduration-sampling rates” procedure is performed on the applicationtraces while the duration of the time spent sampling, t, in block 2914is less than the duration of a period of time T_(p). An exampleimplementation of the “sample traces using duration-sampling rates”procedure is described below with reference to FIG. 38. In block 2915,compression rates C^((α)) are computed according to Equations (15a) and(15b) for duration sampling rates in Equation (12a) and (12b). Indecision block 2916, when the compression rate obtained in block 2915satisfies the condition |C^((α))−C_(s)|<ε, the duration-sampling rates(“DSR”) obtained in block 2911 are returned and used to perform samplingof the application traces in block 2703 of FIG. 27. Otherwise, controlflows to block 2917. In block 2917, an alert is displayed in a GUI, orsent in an email to an administrator or application developer,indicating the duration sampling failed to satisfy the user-selectedcompression rate. In decision block 2918, when the condition|C^((α))−C_(s)|<|C^((β,α))−C_(s)|, the compression rate for durationsampling is closer to the user-selected compression rate than thecompression rate for hybrid sampling and the duration-sampling ratesobtained in block 2911 are returned. Otherwise, the compression rate forhybrid sampling is closer to the user-selected compression rate than thecompression rate for duration sampling and the hybrid-sampling ratesobtained in block 2903 are returned.

FIG. 30 is a flow diagram illustrating an example implementation of the“determine hybrid-sampling rates” procedure performed in block 2903. Inblock 3001, sampling parameters are determined based on theuser-selected sampling rate as described above with reference to TableIII or Table IV. In block 3002, traces are sorted according to tracetype to obtain groups of traces as described above with reference toFIG. 17. A loop beginning with block 3003 repeats the operationsrepresented by blocks 3004-3010. In block 3004, frequencies ofoccurrence of normal and erroneous traces in a group of traces aredetermined as described above with reference to Equations (16a) and(16b). In block 3005, normal and erroneous traces of the group of tracesare sorted according to trace duration as described above with referenceto FIG. 23. In block 3006, a normal trace histogram and an erroneoustrace histogram are constructed as described above with reference toFIGS. 21-23. In block 3007, a frequency of occurrence is determined fortraces in each a bin of the normal trace histogram as described abovewith reference to FIG. 23. In block 3008, a normal hybrid sampling rateis computed for each bin of the normal trace histogram from thefrequency of occurrences as described above with reference to Equation(17a). In block 3009, a frequency of occurrence is determined for tracesin each a bin of the erroneous trace histogram as described above withreference to FIG. 23. In block 3010, an erroneous hybrid sampling rateis computed for each bin of the normal trace histogram from thefrequency of occurrences as described above with reference to Equation(17b). In decision block 3011, blocks 3004-3010 are repeated for anothergroup of traces.

FIG. 31 is a flow diagram illustrating an example implementation of the“determine trace-type sampling rates” procedure performed in block 2902.In block 3101, a sampling parameter β is determined based on theuser-selected sampling rate as described above with reference to TableI. In block 3102, traces are sorted according to trace type to obtaingroups of traces as described above with reference to FIG. 17. A loopbeginning with block 3103 repeats the operations represented by blocks3104-3108. In block 3104, a group of traces are partitioned into normaltraces and erroneous traces. In block 3105, a frequency of occurrence ofthe normal traces is determined as described above with reference toEquation (1a). In block 3106, a normal trace-type sampling rate iscomputed from the frequency of occurrence of traces obtained in block3105 and the sampling parameter β according to Equation (2a). In block3107, a frequency of occurrence of the erroneous traces is determined asdescribed above with reference to Equation (1b). In block 3108, anerroneous trace-type sampling rate is computed from the frequency ofoccurrence of traces obtained in block 3107 and the sampling parameter βaccording to Equation (2b). In decision block 3109, blocks 3104-3108 arerepeated for another group of traces.

FIG. 32 is a flow diagram illustrating an example implementation of the“determine duration-sampling rates” procedure performed in block 2911.In block 3201, a sampling parameter α are determined based on theuser-selected sampling rate as described above with reference to TableII. In block 3202, traces are sorted according to trace duration asdescribed above with reference to FIG. 20. In block 3203, a histogram isconstructed as described above with reference to FIG. 21. A loopbeginning with block 3204 repeats the operations represented by blocks3205 and 3206. In block 3205, a frequency of occurrence of traces in abin of the histogram is determined as described above with reference toEquation (9). In block 3206, a duration-sampling rate is computed fromthe frequency of occurrence of traces obtained in block 3205 and thesampling parameter α according to Equation (10). In decision block 3207,blocks 3205 and 3206 are repeated for another bin of the histogram. Inblock 3208, a frequency of occurrence of traces in the lower bin isdetermined as described above with reference to Equation (11a). In block3209, a short duration sampling rate is computed from the frequency ofoccurrence obtained in block 3208 and the sampling parameter β accordingto Equation (12a). In block 3210, a frequency of occurrence of traces inthe upper bin is determined as described above with reference toEquation (11b). In block 3211, a long duration sampling rate is computedfrom the frequency of occurrence obtained in block 3210 and the samplingparameter β according to Equation (12b).

FIG. 33 is a flow diagram illustrating an example implementation of the“determine normal and erroneous trace sampling rates” procedure in block2804. In block 3301, the set of trace data is partitioned into normaltraces and erroneous traces as described above with reference to FIG.24. In block 3302, a frequency of occurrence is determined as describedabove with reference to Equation (20a). In block 3303, a normal tracesampling rate is determined as described above with reference toEquation (21a). In block 3304, a frequency of occurrence is determinedas described above with reference to Equation (20b). In block 3305, anerroneous trace sampling rate is determined as described above withreference to Equation (21b).

FIG. 34 is a flow diagram illustrating an example implementation of the“determine normal trace sampling rate” procedure in block 2806. In block3401, the set of trace data is partitioned into normal traces anderroneous traces as described above with reference to FIG. 24. In block3402, a frequency of occurrence of normal traces is determined asdescribed above with reference to Equation (20a). In block 3403, afrequency of occurrence of erroneous traces is determined as describedabove with reference to Equation (20b). In block 3404, given an overallsampling rate and an erroneous trace sampling rate, a normal tracesampling rate is determined as described above with reference toEquation (30). In decision block 3405, when the normal trace samplingrate is less than zero, control flows to block 3406. In block 3406, thenormal trace sampling rate is set to a default sampling rate.

FIG. 35 is a flow diagram illustrating an example implementation of the“sample normal traces using the normal trace sampling rate and theerroneous trace sampling rate” procedure performed in block 2703. Indecision block 3501, when hybrid sampling has been selected, controlflows to block 3502. In block 3502, the “sample traces usinghybrid-sampling rates” procedure described below in FIG. 36 isperformed. In decision block 3503, when trace-type sampling has beenselected, control flows to block 3504. In block 3504, the “sample tracesusing trace-type sampling rates” procedure described below in FIG. 37 isperformed. In decision block 3505, when duration sampling has beenselected, control flows to block 3506. In block 3506, the “sample tracesusing duration-sampling rates” procedure described below in FIG. 33 isperformed. In block 3507, a “sample traces using normal and erroneoustrace sampling rates” procedure is performed. An example implementationof the “sample traces using normal and erroneous trace sampling rates”is described below with reference to FIG. 39. In block 3708, acompression rate that corresponds to the sampling rate is computed overa time window. In decision block 3509, when Δ≤Th_(Opt), control flows toblock 3510 and the sampling parameter is adjusted as described abovewith reference to Equation (34b) or Equation (34c).

FIG. 36 is a flow diagram illustrating an example implementation of the“sample traces using hybrid-sampling rates” procedure performed in block3502. This procedure is performed in block 3502 for normal traces andfor erroneous traces. A loop beginning with block 3601 repeats theoperations represented by blocks 3602-3608 for each group of traces. Aloop beginning with block 3202 repeats the operations represented byblocks 3603-3607 for each bin of the normal trace histogram, each bin ofthe erroneous trace histogram, obtained in block 3006 of FIG. 30. A loopbeginning with block 3603, repeats the operations represented by blocks3604-3606 for each trace in the bin. In block 3604, a success (e.g.,“1”) or a failure (e.g., “0”) is computed with the BRBNG for thesampling rate associated with the trace. In decision block 3605, whenoutput of the BRBNG is a success, control flows to block 3606.Otherwise, the output of the BRBNG is a failure and the trace isdiscarded. In block 3606, the trace is stored in a normal trace databasefor the application when the trace is a normal trace and is stored in anerroneous trace database when the trace an erroneous trace. The normaland erroneous databases are persisted in a data storage device. Indecision block 3607, the operations represented by blocks 3604-3606 arerepeated for the traces in the bin. In decision block 3608, theoperations represented by blocks 3603-3607 are repeated for another bin.In decision block 3609, the operations represented by blocks 3602-3608are repeated for another group or traces.

FIG. 37 is a flow diagram illustrating an example implementation of the“sample traces using trace-type sampling rates” procedure performed inblock 3504. This procedure is performed in block 3504 for normal tracesand for erroneous traces. A loop beginning with block 3701 repeats theoperations represented by blocks 3702-3706 for each group of traces. Aloop beginning with block 3702 repeats the operations represented byblocks 3703-3705 for each trace in the group. In block 3703, a success(e.g., “1”) or a failure (e.g., “0”) is computed with the BRBNG for thesampling rate associated with the trace given by Equation (2). Indecision block 3704, when output of the BRBNG is a success, controlflows to block 3705. Otherwise, the output of the BRBNG is a failure andthe trace is discarded. In block 3705, the trace is stored in a normaltrace database for the application when the trace is a normal trace andis stored in an erroneous trace database when the trace an erroneoustrace. The normal and erroneous databases are persisted in a datastorage device. In decision block 3706, the operations represented byblocks 3703-3705 are repeated for each of the traces in the group. Indecision block 3707, the operations represented by blocks 3702-3706 arerepeated for another group of traces.

FIG. 38 is a flow diagram illustrating an example implementation of the“sample traces using duration-sampling rates” procedure performed inblock 3506. This procedure is performed in block 3506 for normal tracesand for erroneous traces. A loop beginning with block 3801 repeats theoperations represented by blocks 3802-3805 for each bin of the histogramobtain in FIG. 22. A loop beginning with block 3802, repeats theoperations represented by blocks 3803-3805 for each trace in the bin. Inblock 3803, a success (e.g., “1”) or a failure (e.g., “0”) is computedwith the BRBNG for the sampling rate associated with the trace. Indecision block 3804, when output of the BRBNG is a success, controlflows to block 3805. Otherwise, the output of the BRBNG is a failure andthe trace is discarded. In block 3805, the trace is stored in a normaltrace database for the application when the trace is a normal trace andis stored in an erroneous trace database when the trace an erroneoustrace. The normal and erroneous databases are persisted in a datastorage device. In decision block 3806, the operations represented byblocks 3803-3805 are repeated for each of the traces in the bin. Indecision block 3807, the operations represented by blocks 3802-3806 arerepeated for another bin.

FIG. 39 is a flow diagram illustrating an example implementation of the“sample traces using normal and erroneous sampling rates” procedureperformed in block 3508. This procedure is performed in block 3508 fornormal traces and for erroneous traces. A loop beginning with block 3901repeats the operations represented by blocks 3902-3904 for each trace inthe normal traces and again for each trace in the erroneous traces. Inblock 3902, a success (e.g., “1”) or a failure (e.g., “0”) is computedwith the BRBNG for the sampling rate associated with the trace. Indecision block 3903, when output of the BRBNG is a success, controlflows to block 3904. Otherwise, the output of the BRBNG is a failure andthe trace is discarded. In block 3904, the trace is stored in a normaltrace database for the application when the trace is a normal trace andis stored in an erroneous trace database when the trace an erroneoustrace. The normal and erroneous databases are persisted in a datastorage device. In decision block 3905, the operations represented byblocks 3902-3904 are repeated for each of the traces in the normaltraces and the erroneous traces of the set of trace data. In decisionblock 3905, the operations represented by blocks 3902-3904 are repeatedfor another group of traces.

It is appreciated that the previous description of the disclosedembodiments is provided to enable any person skilled in the art to makeor use the present disclosure. Various modifications to theseembodiments will be apparent to those skilled in the art, and thegeneric principles defined herein may be applied to other embodimentswithout departing from the spirit or scope of the disclosure. Thus, thepresent disclosure is not intended to be limited to the embodimentsshown herein but is to be accorded the widest scope consistent with theprinciples and novel features disclosed herein.

1. A method stored in one or more data storage devices and executedusing one or more processors of a computer system for sampling a set oftraces of an application executed in a distributed computing system, themethod comprising: retrieving a set of trace data associated with theapplication from a data storage device; determining sampling rates forsampling normal traces in the set and for sampling erroneous traces inthe set, wherein the different sampling rates are inversely proportionalto the frequency of occurrence of the normal traces and erroneoustraces; sampling the traces using the sampling rates to obtain samplednormal traces and sampled erroneous traces, wherein less frequentlyoccurring normal traces are sampled at higher sampling rates than morefrequently occurring normal traces and less frequently occurringerroneous traces are sampled at higher sampling rates than morefrequently occurring erroneous traces; and storing the sampled traces ina data storage device.
 2. The method of claim 1 wherein determining thesampling rates comprises: sorting the traces according to trace type toobtain one or more groups of traces, each group of traces having adifferent associated trace type; and for each group of traces.partitioning the group of traces into normal traces and erroneoustraces, determining a frequency of occurrence of normal traces in thegroup, determining a frequency of occurrence of erroneous traces in thegroup, constructing a normal trace histogram of the normal traces,constructing an erroneous trace histogram of the erroneous traces,determining a frequency of occurrence of normal traces in each bin ofthe normal trace histogram, determining a frequency of occurrence ofnormal traces in each bin of the erroneous trace histogram, determininga normal hybrid sampling rate for each bin of the normal histogram basedon the frequency of occurrence of normal traces in each bin, thefrequency of occurrence of the normal traces, and determining anerroneous hybrid sampling rate for each bin of the normal histogrambased on the frequency of occurrence of normal traces in each bin, thefrequency of occurrence of the normal traces.
 3. The method of claim 1wherein determining the sampling rates comprises: sorting the tracesaccording to trace type to obtain one or more groups of traces, eachgroup of traces having a different associated trace type; receiving asampling level via a graphical user interface; determining a trace-typesampling parameter based on the user-selected sampling level; and foreach group of traces, partitioning the group of traces into normaltraces and erroneous traces; determining a frequency of occurrence ofnormal traces in the group of traces, determining a normal trace-typesampling rate based on the frequency of occurrence of normal traces,determining a frequency of occurrence of erroneous traces in the groupof traces, determining an erroneous trace-type sampling rate based onthe frequency of occurrence of erroneous traces.
 4. The method of claim1 wherein determining the sampling rates comprises: constructing ahistogram of traces based on the durations, each bin of the histogramcorresponding to a time interval and containing traces with durations inthe time interval; determining a frequency of occurrence of normaltraces in each bin of the histogram; for each bin of the histogram,determining a duration-sampling rate based on the frequency ofoccurrence of traces in the bin and the duration-sampling parameter:determining a frequency of occurrence of traces in a lower bin;computing a short duration sampling rate from the frequency ofoccurrence of traces in the lower bin; determining a frequency ofoccurrence of traces in an upper bin; and computing a long durationsampling rate from the frequency of occurrence of traces in the upperbin.
 5. The method of claim 1 wherein determining the sampling ratescomprises: partitioning the set of trace data into normal traces anderroneous traces: determining a frequency of occurrence of the normaltraces: determining a normal trace sampling rate based on the frequencyof occurrence of the normal traces; determining a frequency ofoccurrence of the erroneous traces; and determining a erroneous tracesampling rate based on the frequency of occurrence of the erroneoustraces.
 6. The method of claim 1 wherein determining the sampling ratescomprises: partitioning the set of trace data into normal traces anderroneous traces; determining a frequency of occurrence of the normaltraces; determining a frequency of occurrence of the erroneous traces;and determining a normal trace sampling rate based on the frequency ofoccurrence of the normal traces, frequency of occurrence of theerroneous traces, an overall sampling rate, and erroneous trace samplingrate.
 7. The method of claim 1 wherein sampling the traces using thesampling rates comprises sampling normal traces with a normal tracesampling rate, wherein the normal trace sampling rate is inverselyproportional to a frequency of occurrence of the normal traces.
 8. Themethod of claim 1 wherein sampling the traces using the sampling ratescomprises sampling erroneous traces with an erroneous trace samplingrate, wherein the erroneous trace sampling rate is inverselyproportional to a frequency of occurrence of the erroneous traces. 9.The method of claim 1 further comprises: performing troubleshooting onthe sampled erroneous traces to identify a performance problem with theapplication; and executing remedial measures to correct the performanceproblem.
 10. A computer system for sampling application traces of anapplication executed in a distributed computer system, the systemcomprising: one or more processors: one or more data storage devices:and machine-readable instructions stored in the one or more data storagedevices that when executed using the one or more processors controls thesystem to perform operations comprising: retrieving a set of trace dataassociated with the application from a data storage device; determiningsampling rates for sampling normal traces in the set and for samplingerroneous traces in the set, wherein the different sampling rates areinversely proportional to the frequency of occurrence of the normaltraces and erroneous traces: sampling the traces using the samplingrates to obtain sampled normal traces and sampled erroneous traces,wherein less frequently occurring normal traces are sampled at highersampling rates than more frequently occurring normal traces and lessfrequently occurring erroneous traces are sampled at higher samplingrates than more frequently occurring erroneous traces; and storing thesampled traces in a data storage device.
 11. The computer system ofclaim 10 wherein determining the sampling rates comprises: sorting thetraces according to trace type to obtain one or more groups of traces,each group of traces having a different associated trace type; and foreach group of traces, partitioning the group of traces into normaltraces and erroneous traces, determining a frequency of occurrence ofnormal traces in the group, determining a frequency of occurrence oferroneous traces in the group constructing a normal trace histogram ofthe normal traces, constructing an erroneous trace histogram of theerroneous traces, determining a frequency of occurrence of normal tracesin each bin of the normal trace histogram, determining a frequency ofoccurrence of normal traces in each bin of the erroneous tracehistogram, determining a normal hybrid sampling rate for each bin of thenormal histogram based on the frequency of occurrence of normal tracesin each bin, the frequency of occurrence of the normal traces, anddetermining an erroneous hybrid sampling rate for each bin of the normalhistogram based on the frequent of occurrence of normal traces in eachbin, the frequency of occurrence of the normal traces.
 12. The computersystem of claim 10 wherein determining the sampling rates comprises:sorting the traces according to trace type to obtain one or more groupsof traces, each group of traces having a different associated tracetype; receiving a sampling level via a graphical user interface;determining a trace-type sampling parameter based on the user-selectedsampling level; and for each group of traces, partitioning the group oftraces into normal traces and erroneous traces; determining a frequencyof occurrence of normal traces in the group of traces, determining anormal trace-type sampling rate based on the frequency of occurrence ofnormal traces, determining a frequency of occurrence of erroneous tracesin the group of traces, determining an erroneous trace-type samplingrate based on the frequency of occurrence of erroneous traces.
 13. Thecomputer system of claim 10 wherein determining the sampling ratescomprises: constructing a histogram of traces based on the durations,each bin of the histogram corresponding to a time interval andcontaining traces with durations in the time interval; determining afrequency of occurrence of normal traces in each bin of the histogram:for each bin of the histogram, determining a duration-sampling ratebased on the frequency of occurrence of traces in the bin and theduration-sampling parameter; determining a frequency of occurrence oftraces in a lower bin; computing a short duration sampling rate from thefrequency of occurrence of traces in the lower bin; determining afrequency of occurrence of traces in an upper bin; and computing a longduration sampling rate from the frequency of occurrence of traces in theupper bin.
 14. The computer system of claim 10 wherein determining thesampling rates comprises: partitioning the set of trace data into normaltraces and erroneous traces; determining a frequency of occurrence ofthe normal traces; determining a normal trace sampling rate based on thefrequency of occurrence of the normal traces; determining a frequency ofoccurrence of the erroneous traces; and determining a erroneous tracesampling rate based on the frequency of occurrence of the erroneoustraces.
 15. The computer system of claim 10 wherein determining thesampling rates comprises: partitioning the set of trace data into normaltraces and erroneous traces; determining a frequency of occurrence ofthe normal traces; determining a frequency of occurrence of theerroneous traces; and determining a normal trace sampling rate based onthe frequency of occurrence of the normal traces, frequency ofoccurrence of the erroneous traces, an overall sampling rate, anderroneous trace sampling rate.
 16. The computer system of claim 10wherein sampling the traces using the sampling rates comprises samplingnormal traces with a normal trace sampling rate, wherein the normaltrace sampling rate is inversely proportional to a frequency ofoccurrence of the normal traces.
 17. The computer system of claim 10wherein sampling the traces using the sampling rates comprises samplingerroneous traces with an erroneous trace sampling rate, wherein theerroneous trace sampling rate is inversely proportional to a frequencyof occurrence of the erroneous traces.
 18. The computer system of claim10 further comprises: performing troubleshooting on the samplederroneous traces to identify a performance problem with the application;and executing remedial measures to correct the performance problem. 19.A non-transitory computer-readable medium encoded with machine-readableinstructions that when executed by one or more processors of a computersystem perform operations comprising: retrieving a set of trace dataassociated with the application from a data storage device; determiningsampling rates for sampling normal traces in the set and for samplingerroneous traces in the set, wherein the different sampling rates areinversely proportional to the frequency of occurrence of the normaltraces and erroneous traces: sampling the traces using the samplingrates to obtain sampled normal traces and sampled erroneous traces,wherein less frequently occurring normal traces are sampled at highersampling rates than more frequently occurring normal traces and lessfrequently occurring erroneous traces are sampled at higher samplingrates than more frequently occurring erroneous traces; and storing thesampled traces in a data storage device.
 20. The medium of claim 19wherein determining the sampling rates comprises: sorting the tracesaccording to trace type to obtain one or more groups of traces, eachgroup of traces having a different associated trace type; and for eachgroup of traces, partitioning the group of traces into normal traces anderroneous traces, determining a frequency of occurrence of normal tracesin the group, determining a frequency of occurrence of erroneous tracesin the group, constructing a normal trace histogram of the normaltraces, constructing an erroneous trace histogram of the erroneoustraces, determining a frequency of occurrence of normal traces in eachbin of the normal trace histogram, determining a frequency of occurrenceof normal traces in each bin of the erroneous trace histogram,determining a normal hybrid sampling rate for each bin of the normalhistogram based on the frequency of occurrence of normal traces in eachbin, the frequency of occurrence of the normal traces, and determiningan erroneous hybrid sampling rate for each bin of the normal histogrambased on the frequency of occurrence of normal traces in each bin, thefrequency of occurrence of the normal traces.
 21. The medium of claim 19wherein determining the sampling rates comprises: sorting the tracesaccording to trace type to obtain one or more groups of traces, eachgroup of traces having a different associated trace type; receiving asampling level via a graphical user interface; determining a trace-typesampling parameter based on the user-selected sampling level; and foreach group of traces, partitioning the group of traces into normaltraces and erroneous traces; determining a frequency of occurrence ofnormal traces in the group of traces, determining a normal trace-typesampling rate based on the frequency of occurrence of normal traces,determining a frequency of occurrence of erroneous traces in the groupof traces, determining an erroneous trace-type sampling rate based onthe frequency of occurrence of erroneous traces.
 22. The medium of claim19 wherein determining the sampling rates comprises: constructing ahistogram of traces based on the durations, each bin of the histogramcorresponding to a time interval and containing traces with durations inthe time interval; determining a frequency of occurrence of normaltraces in each bin of the histogram; for each bin of the histogram,determining a duration-sampling rate based on the frequency ofoccurrence of traces in the bin and the duration-sampling parameter;determining a frequency of occurrence of traces in a lower bin;computing a short duration sampling rate from the frequency ofoccurrence of traces in the lower bin; determining a frequency ofoccurrence of traces in an upper bin; and computing a long durationsampling rate from the frequency of occurrence of traces in the upperbin.
 23. The medium of claim 19 wherein determining the sampling ratescomprises: partitioning the set of trace data into normal traces anderroneous traces; determining a frequency of occurrence of the normaltraces; determining a normal trace sampling rate based on the frequencyof occurrence of the normal traces; determining a frequency ofoccurrence of the erroneous traces; and determining a erroneous tracesampling rate based on the frequency of occurrence of the erroneoustraces.
 24. The medium of claim 19 wherein determining the samplingrates comprises: partitioning the set of trace data into normal tracesand erroneous traces; determining a frequency of occurrence of thenormal traces; determining a frequency of occurrence of the erroneoustraces; and determining a normal trace sampling rate based on thefrequency of occurrence of the normal traces, frequency of occurrence ofthe erroneous traces, an overall sampling rate, and erroneous tracesampling rate.
 25. The medium of claim 19 wherein sampling the tracesusing the sampling rates comprises sampling normal traces with a normaltrace sampling rate, wherein the normal trace sampling rate is inverselyproportional to a frequency of occurrence of the normal traces.
 26. Themedium of claim 19 wherein sampling the traces using the sampling ratescomprises sampling erroneous traces with an erroneous trace samplingrate, wherein the erroneous trace sampling rate is inverselyproportional to a frequency of occurrence of the erroneous traces. 27.The medium of claim 19 further comprises: performing troubleshooting onthe sampled erroneous traces to identify a performance problem with theapplication; and executing remedial measures to correct the performanceproblem.